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Foreword 



The objective of the workshops associated with the ER’99 18th International 
Conference on Conceptual Modeling is to give participants access to high level 
presentations on specialized, hot, or emerging scientific topics. Three themes 
have been selected in this respect: 

— Evolution and Change in Data Management (ECDM’99) dealing with han- 
dling the evolution of data and data structure, 

— Reverse Engineering in Information Systems (REIS’99) aimed at exploring 
the issues raised by legacy systems, 

— The World Wide Web and Conceptual Modeling (WWWCM’99) which ana- 
lyzes the mutual contribution of WWW resources and techniques with con- 
ceptual modeling. 

ER’99 has been organized so that there is no overlap between conference ses- 
sions and the workshops. Therefore participants can follow both the conference 
and the workshop presentations they are interested in. 

I would like to thank the ER’99 program co-chairs, Jacky Akoka and Mokrane 
Bouzeghoub for having given me the opportunity to organize these workshops. 
I would also like to thank Stephen Liddle for his valuable help in managing the 
evaluation procedure for submitted papers and helping to prepare the workshop 
proceedings for publication. 

August 1999 Jacques Kouloumdjian 

Preface for ECDM’99 

The first part of this volume contains the proceedings of the First International 
Workshop on Evolution and Change in Data Management, ECDM’99, which 
was held in conjunction with the 18th International Conference on Conceptual 
Modeling (ER’99) in Paris, France, November 15-18, 1999. 

The management of evolution and change and the ability of data, information, 
and knowledge-based systems to deal with change is an essential component in 
developing truly useful systems. Many approaches to handling evolution and 
change have been proposed in various areas of data management and ECDM’99 
has been successful in bringing together researchers from both more established 
areas and from emerging areas to look at this issue. This workshop dealt with 
the modeling of changing data, the manner in which change can be handled, and 
the semantics of evolving data and data structure in computer based systems. 

Following the acceptance of the idea of the workshop by the ER’99 com- 
mittee, an international and highly qualified program committee was assembled 
from research centers worldwide. As a result of the call for papers, the pro- 
gram committee received 19 submissions from 17 countries and after rigorous 
refereeing 11 high quality papers were eventually chosen for presentation at the 
workshop which appear in these proceedings. 
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Foreword 



I would like to thank both the program committee members and the ad- 
ditional external referees for their timely expertise in reviewing the papers. I 
would also like to thank the ER’99 organizing committee for their support, and 
in particular Jacques Kouloumdjian, Mokrane Bouzeghoub, Jacky Akoka, and 
Stephen Liddle. 

August 1999 John F. Roddick 



Preface for REIS’99 

Reverse Engineering in Information Systems (in its broader sense) is receiving 
an increasing amount of interest from researchers and practitioners. This is not 
only the effect of some critical dates such as the year 2000. It derives simply 
from the fact that there is an ever-growing number of systems that need to be 
evolved in many different respects, and it is always a difficult task to handle 
this evolution properly. One of the main problems is gathering information on 
old, poorly documented systems prior to developing a new system. This analysis 
stage is crucial and may involve different approaches such as software and data 
analysis, data mining, statistical techniques, etc. 

An international program committee has been formed which I would like to 
thank for its timely and high quality evaluation of the submitted papers. 

The workshop has been organized in two sessions: 

1. Methodologies for Reverse Engineering. The techniques presented in this ses- 
sion are essentially aimed at obtaining information on data for finding data 
structures, eliciting generalization hierarchies, or documenting the system, 
the objective being to be able to build a high-level schema on data. 

2. System Migration. This session deals with the problems encountered when 
migrating a system into a new environment (for example changing its data 
model). One point which has not been explored much is the forecasting of the 
performance of a new system which is essential for mission-critical systems. 

I do hope that these sessions will raise fruitful discussions and exchanges. 
August 1999 Jacques Kouloumdjian 

Preface for WWWCM’99 

The purpose of the International Workshop on the World Wide Web and Con- 
ceptual Modeling (WWWCM’99) is to explore ways conceptual modeling can 
contribute to the state of the art in Web development, management, and use. 
We are pleased with the interest in WWWCM’99. Thirty-six papers were sub- 
mitted (one was withdrawn) and we accepted twelve. We express our gratitude 
to the authors and our program committee, who labored diligently to make the 
workshop program possible. The papers are organized into four sessions: 
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1. Modeling Navigation and Interaction. These papers discuss modeling inter- 
active Web sites, augmentations beyond traditional modeling, and scenario 
modeling for rapid prototyping as alternative ways to improve Web naviga- 
tion and interaction. 

2. Directions in Web Modeling. The papers in this session describe three direc- 
tions to consider: (1) modeling superimposed information, (2) formalization 
of Web-application specifications, and (3) modeling by patterns. 

3. Modeling Web Information. These papers propose a conceptual-modeling 
approach to mediation, semantic access, and knowledge discovery as ways to 
improve our ability to locate and use information from the Web. 

4. Modeling Web Applications. The papers in this session discuss how concep- 
tual modeling can help us improve Web applications to do this. The particu- 
lar applications addressed are knowledge organization, Web-based teaching 
material, and data warehouses in e-commerce. 

It is our hope that this workshop will attune developers and users to the benefits 
of using conceptual modeling to improve the Web and the experience of Web 
users. 

August 1999 Peter P. Chen, David W. Embley, and Stephen W. Liddle 

WWWCM’99 Web Site: www.cm99.byu.edu 
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Abstract. Traditionally product data and their evolving definitions, have been 
handled separately from process data and their evolving definitions. There is 
little or no overlap between these two views of systems even though product 
and process data are inextricably linked over the complete software lifecycle 
from design to production. The integration of product and process models in an 
unified data model provides the means by which data could be shared across an 
enterprise throughout the lifecycle, even while that data continues to evolve. In 
integrating these domains, an object oriented approach to data modelling has 
been adopted by the CRISTAL (Cooperating Repositories and an Information 
System for Tracking Assembly Lifecycles) project. The model that has been 
developed is description-driven in nature in that it captures multiple layers of 
product and process definitions and it provides object persistence, flexibility, 
reusability, schema evolution and versioning of data elements. This paper 
describes the model that has been developed in CRISTAL and how descriptive 
meta-objects in that model have their persistence handled. It concludes that 
adopting a description-driven approach to modelling, aligned with a use of 
suitable object persistence, can lead to an integration of product and process 
models which is sufficiently flexible to cope with evolving data definitions. 



Keywords: Description-Driven systems. Modelling change, schema evolution, versioning 



1. Introduction 

This study investigates how evolving data can be handled through the use of 
description-driven systems. Here description-driven systems are defined as systems in 
which the definition of the domain-specific configuration is captured in a computer- 
readable form and this definition is interpreted by applications in order to achieve the 
domain-specific goats. In a description-driven system definitions are separated from 
instances and managed independently, to allow the definitions to be specifted and to 
evolve asynchronously from particular instantiations (and executions) of those 
definitions. As a consequence a description-driven system requires computer-readable 
models both for definitions and for instances. These models are loosely coupled and 
coupling only takes place when instances are created or when a definition. 
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corresponding to existing instantiations, is modified. The coupling is loose since the 
lifecycle of each instantiation is independent from the lifecycle of its corresponding 
definition. 

Description-driven systems (sometimes referred to as meta-systems) are 
acknowledged to be flexible and to provide many powerful features including (see [1], 
[2], [3]): Reusability, Complexity handling. Version handling. System evolution and 
Interoperability. This paper introduces the concept of description-driven systems in 
the context of a specific application (the CRIST AL project), discusses other related 
work in handling schema and data evolution, relates this to more familiar multi-layer 
architectures and describes how multi-layer architectures allow the above features to 
be realised. It then considers how description-driven systems can be implemented 
through the use of meta-objects and how these structures enable the handling of 
evolving data. 



2. Related Work 

This section places the current work in the context of other schema evolution work. 
A comprehensive overview of the state of the art in schema evolution research is 
discussed in [4j. T wo approaches in implementing the schema change are presented. 
The internal schema evolution approach uses schema update primitives in changing 
the schema. The external schema evolution approach gives the schema designers the 
flexibility of manually changing the schema using a schema-dump text and importing 
the change onto the system later. 

A schema change affects other parts of the schema, the object instances in the 
underlying databases, and the application programs using these databases. As far as 
object instances are concerned, there are two strategies in carrying out the changes. 
Immediate conversion strategy immediately converts all object instances. Deferred 
conversion strategy takes note of the affected instances and the way they have to be 
changed, and conversion is done when the instance is accessed. None of the strategies 
developed for handling schema evolution complexity is suitable for all application 
fields due to different requirements concerning - permanent database availability, 
real-time requirements and space limitations. More so, existing support for schema 
evolution in current OODB systems (02, GemStone, Itasca, ObjectStore) is limited to 
a pre-defined set of schema evolution operations with fixed semantics [5]. CRIST AL 
adheres to a modified deferred conversion strategy. Changes to the production schema 
are made available upon request of the latest production release. 

Several novel ideas have been put forward in handling schema evolution. Graph 
theory is proposed as a solution in delecting and incorporating schema change without 
affecting underlying databases and applications [6] and some of these ideas have been 
folded into the current work. An integrated schema evolution and view support system 
is presented in [7]. In CRISTAL there are two types of schemas - the user’s personal 
view and the global schema. A user’s change in the schema is applied to the user’s 
personal view schema instead of the global schema. The persistent data is shared by 
multiple views of the schema. 
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To cope with growing complexity of schema update operations in most current 
applications, high level primitives are being provided to allow more complex and 
advanced schema changes [8]. Along the same line, the SERF framework [5] succeeds 
in giving users the flexibility to define the semantics of their choice and the 
extensibility of defining new complex transformations. SERF claims to provide the 
first extensible schema evolution framework. Current systems supporting schema 
evolution concentrate on changes local to individual types. The Type Evolution 
Software System (TESS) [8] allows for both local and compound changes involving 
multiple types. The old and the new schema are compared, a transformation function 
is produced and updates the underlying databases. The need for evolution taxonomy 
involving relationships is very much evident in 00 systems. A set of basic evolution 
primitives for uni-directional and bi-directional relationships is presented in [9]. No 
work has been done on schema evolution of an object model with relationships prior 
to this research. 

Another approach in handling schema updates is schema versioning. Schema 
versioning approach takes a copy of the schema, modifies the copy, thus creating a 
new schema version. The CRIST AL schema versioning follows this principle. In 
CRISTAL, schema changes are realised through dynamic production re-configuration. 
The process and product definitions, as well as the semantics linking a process 
definition to a product definition, are versionable. Different versions of the production 
co-exist and are made available to the rest of the CRISTAL users. 

An integration of the schema versioning approach and the more common 
adaptational approach is given in [10]. An adaptational approach, also called direct 
schema evolution approach, always updates the schema and the database in place 
using schema update primitives and conversion functions. The work in [10] is the first 
time when both approaches have been integrated into a general schema evolution 
model. 

3. The CRISTAL Project 

A prototype has been developed which allows the study of data evolution in 
product and process modelling. The objective of this prototype is to integrate a 
Product Data Management (PDM) model with a Workflow Management (WfM) 
model in the context of the CRISTAL (Cooperating Repositories and Information 
System for Tracking Assembly Lifecycles) project currently being undertaken at 
CERN, the European Centre for Particle Physics in Geneva, Switzerland. See [11, 12, 
13, 14]. 

The design of the CRISTAL prototype was dictated by the requirements for 
adaptability over extended timescales, for schema evolution, for interoperability and 
for complexity handling and reusability. In adopting a description-driven design 
approach to address these requirements, a separation of object instances from object 
descriptions instances was needed. This abstraction resulted in the delivery of a meta- 
model as well as a model for CRISTAL. 

The CRISTAL meta-model is comprised of so-called ‘meta-objects’ each of which 
is defined for a class of significance in the data model: e.g part definitions for parts. 
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activity definitions for activities, and executor definitions for executors (e.g 
instruments, automatically-launched code etc.)- 
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Figure 1. Subset of the CRISTAL meta-model. 



Figure 1 shows a subset of tlie CRISTAL meta-model. In the model information is 
stored at specification time for types of parts or part definitions and at assembly time 
for individual instantiations of part definitions. At the design stage of the project 
information is stored against the definition object and only when the project 
progresses is information stored on an individual part basis. This meta-object 
approach reduces system complexity by promoting object reuse and translating 
complex hierarchies of object instances into (directed acyclic) graphs of object 
definitions. Meta-objects allow the capture of knowledge (about the object) alongside 
the object themselves, enriching the model and facilitating self-description and data 
independence. It is believed that the use of meta-objects provides the flexibility 
needed to cope with their evolution over the extended timescales of CRISTAL 
production. Discussion on how the use of meta-objects provides the flexibility needed 
to cope with evolving data in CRISTAL through the so-called Production Scheme 
Definition Handler (PSDH) is detailed in a separate section of this paper. 

Figure 2 shows the software architecture for the definition of a production scheme 
in the CRISTAL Central System. It is composed of a Desktop Control Panel (DCP) 
for the Coordinator, a GUI for creating and updating production specifications and the 
PSDH for handling persistency and versioning of the definition objects in the 
Configuration Database. 




Figure 2. Software architecture of definition handler in Central System. 
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Figure 3. A 4>layer meta-modeling architecture. 



4. The Layered Architecture of Description Driven Systems 

The concept of models which describe other models has come to be known as 
‘meta-models' and is gaining wide acceptance in the world of object-oriented design. 
One example of a system which uses a multi-layer architecture is that of a WfM 
system [15]. In WfM systems the workflow instances (such as activities or tasks) 
correspond to the lowest level of abstraction - the instance layer. In order to instantiate 
the workflow objects a workflow scheme is required. This scheme describes the 
workflow instances and corresponds to the next layer of abstraction - the model layer. 
The information about a model is generally described as meta-data. In order for the 
workflow scheme itself to be built, a further model is required to capture/hold the 
semantics for the generation of the workflow scheme. This model (i.e. a model 
describing another model) is the next layer of abstraction - the so-called meta-model 
layer. In other words a meta-model is simply an abstraction of meta-data [16]. 

The semantics required to adequately model the information in the application 
domain of interest will in most cases be different. What is required for integration and 
exchange of various meta-models is a universal type language capable of describing 
all meta-information. The common approach is to define an abstract language, which 
is capable of defining another language for specifying a particular meta-model, in 
other words meta-mcta-information (c.f. [17]). In this manner it is possible to have a 
number of meta-model layers. The generally accepted conceptual framework for meta 
modeling is based on an architecture with four layers [18]. Figure 3 illustrates the four 
layer meta-modeling architecture adopted by the OMG and based on the ISO 11179 
standard. 

The meta-meta-model layer is the layer responsible for defining a general modeling 
language for specifying meta-models. This top layer is the most abstract and must 
have the capability of modeling any meta-model. It comprises the design artifacts in 
common to any meta-model. At the next layer down a (domain specific) meta-model is 
an instance of a meta-meta-model. It is the responsibility of this layer to define a 
language for specifying models, which is itself defined in terms of the meta-meta types 
(such as meta-class, meta-relationship, etc.) of the meta-meta modeling layer above. 
Examples from manufacturing of objects at this level include workflow process 
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description, nested subprocess description and product descriptions. A model at layer 
two is an instance of a meta-model. The primary responsibility of the model layer is to 
define a language that describes a particular information domain. So example objects 
for the manufacturing domain would be product, measurement, production schedule, 
composite product. At the lowest level user objects are an instance of a model and 
describe a specific information and application domain. 

5. Features of Description Driven Systems 

It is the basic tenet of this study that the desirable features of description-driven 
systems can be realised through the adoption of a flexible multi-layered system 
architecture. This section examines each required feature in turn and explains how a 
multi-layer architecture facilitates those features. 

• Reusability. It is a natural consequence of separating definition from instantiation 
in a system that reusability is promoted. Each definition can be instantiated many 
times and can be reused for multiple applications. 

• Complexity handling (scalability). As systems grow in complexity it becomes 
increasingly necessary to capture descriptions of system elements, rather than 
capturing detail associated with each individual instantiation of an element, to 
alleviate data management. Scalability can be eased, and a potential explosion in the 
number of products to be managed can be avoided, if descriptive information is held 
both at the model and meta-model layers of a multi-layer architecture and, in addition, 
if information is captured about the mechanism for the instantiation of objects at a 
particular level. In a multi-layer architecture, as abstraction from instance to model to 
meta-model is followed, there are fewer data and types to manage at each layer but 
more semantics must be specified so that system complexity and flexibility can be 
simultaneously catered for. 

• Version handling. It is natural for systems to change over time - new elements are 
specified, existing elements are amended and some are deleted. Element descriptions 
can also be subject to change over time. Separating description from instantiation 
allows new versions of elements (or element descriptions) to coexist with older 
versions that have been previously instantiated. 

• System evolution. When descriptions move from one version to the next the 
underlying system should cater with this evolution. However, existing production 
management systems, as used in industry, cannot cater for this. In capturing 
description separate from instantiation, using a multi-layer architecture, it is possible 
for system evolution to be catered for while production is underway and therefore to 
provide continuity in the production process and for design changes to be reflected 
quickly into production. 

• Interoperability. A fundamental requirement in making two distributed systems 
interoperate is that their software components can communicate and exchange data. In 
order to interoperate and to adapt to reconfigurations and versions, large scale systems 
should become ‘self describing’. It is desirable for systems to be able to retain 
knowledge about their dynamic structure and for this knowledge to be available to the 
rest of the distributed infrastructure through the way that the system is plugged 
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together. This is absolutely critical and necessary for the next generation of distributed 
systems to be able to cope with size and complexity explosions. 



6. Repository-Based Handling of Description 

The CRISTAL system is composed of a Central System whose purpose is to control 
the overall production process and several Local Centres at which production 
(assembly and testing) is carried out. The overall system Coordinator at the Central 
System defines the design or configuration of the different detector components, 
called the Detector Production Scheme (DPS). The Coordinator is responsible for 
supplying the production scheme to the different Local Centres. The local Operator at 
each Local Centre applies the production scheme to the local production line. 

The design of the detector changes over time and many versions of the production 
exists. The DPS object stores the production scheme version number, and the list of 
definition objects used in that version. DPS is also versionable, thus creating a linear 
production scheme version geneology, storing the history of the different definitions 
and the history of the different production schemes. Changes to the definitions are 
allowed. This is done centrally and made available to the rest of the system via the 
next supply of the configuration by the Coordinator. After each supply, the 
Coordinator has the option to start with an empty production scheme or from the last 
supplied production scheme. The supply automatically creates a new version of the 
DPS object, and attaches it to the end of the version geneology. 

Each definition object is divided into two parts - definition attributes (not 
versioned) and definition properties (versioned). New definition properties are 
created whenever the definition object is updated for the first time in a new production 
scheme. The version table, which tracks which production scheme version a property 
has been created for, is part of the attributes. Figure 4 shows the organisation of the 
different versions of persistent CRISTAL objects. 




Figure 4. Organisation of persistent CRISTAL definition objects. 

The persistent definition objects are stored using Objectivity. The PSDH is 
responsible for handling the production scheme versions and the definition versions. 
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The PSDH is implemented using C++ and is a CORBA service layered on top of the 
DB, providing the different Coordinator functionalities. Each definition object is 
assigned a unique identification number (OID) generated by the PSDH. To access a 
particular definition object, the OID is given, plus the production scheme version of 
the property. The version table is scanned to locate the correct property, if it exists, 
and the full definition is then loaded into the user interface (DCP) of the Coordinator. 






Conltm Ilf » 1 jvir 



ArdiltedunJ Caroiwineiito 



M«la*M»del lJiy«r 



MikIcI Layer 




Figure 5 . Subset of CRISTAL Configuration Architecture 

The PSDH and its relation to the different layers of the CRISTAL architecture is 
shown in Figure 5. The CRISTAL meta-model is an abstraction of the CRISTAL 
model. It captures and holds the semantics for the generation of the CRISTAL model. 
It describes the elements for overall production scheme management and embodies the 
adopted versioning policies in CRISTAL. The CRISTAL model is composed of the 
definition classes and the associations between these classes. The PSDH provides the 
mechanisms for production specification management. It describes and manages the 
complete lifecycle of all CRISTAL definitions used to specify the design of a sub- 
detector. 

The PSDH allows for the co-existence of many versions of the production scheme. 
It caters for dynamic design upgrade, that is, the creation of new detector components 
or changes in existing detector specification, without the need for the production line 
to stop so that a new scheme can be uploaded. Likewise, the history of the versions of 
the configuration is stored in the database, allowing users to access historical data. 
Consequently, evolving design knowledge is managed transparently without the need 
for the production line to be flushed. 

7. Conclusions 

It is apparent that the description-driven approach, advocated in this paper, reduces 
system complexity by promoting object reuse and by translating complex hierarchies 
of object instances into graphs of object definitions. A meta-model of the schema can 
be stored in the database, which describes the actual objects and allows changes to be 
made without the need to alter the database schema itself. This also makes it possible 
to store different versions of objects concurrently in the same database. A model can 
be derived from this meta-model which is sufficient to perform the PDM-WfMS 
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integration. The CRISTAL system has demonstrated that evolving data and 
persistence can be handled using a description driven system approach. 

The experience of using meta models and meta objects at the analysis and design 
phase in the CRISTAL project has been very positive. Designing the meta model 
separately from the runtime model has allowed the design team to provide consistent 
solutions to dynamic change and versioning. The object models are described using 
UML [19] which itself can be described by the OMG Meta Object Facility [20] and is 
the candidate choice by OMG for describing all business models. 

In distributed object-based systems, object request brokers, such as the Object 
Management Group’s CORBA [21] provide for the exchange of simple data types 
and, in addition, provide location and access services. The CORBA standard is meant 
to standardise how systems interoperate. OMG’s CORBA Services specify how 
distributed objects should participate and provide services such as naming, persistent 
storage, lifecycle, transaction, relationship and query. The CORBA Services standard 
is an example of how self-describing software components can interact to provide 
interoperable systems. 

The current phase of CRISTAL research aims to adopt an open architectural 
approach, based on a meta-model and an extraction facility to produce an adaptable 
system capable of interoperating with future systems and of supporting views onto an 
engineering data warehouse. The meta-model approach to design reduces system 
complexity, provides model flexibility and can integrate multiple, potentially 
heterogeneous, databases into the enterprise-wide data warehouse. A first prototype 
for CRISTAL based on CORBA, Java and Objectivity technologies has been deployed 
in the autumn of 1998 [22]. The second phase of research will culminate in the 
delivery of a production system in 1999 supporting queries onto the meta-model and 
the definition, capture and extraction of data according to user-defined viewpoints. 

Recently a considerable amount of interest has been generated in meta-models and 
meta-object description languages. Work has been completed within the OMG on the 
Meta Object Facility which is expected to manage all meta-models relevant to the 
OMG Architecture. The purpose of the OMG MOF is to provide a set of CORBA 
interfaces that can be used to define and manipulate a set of interoperable meta 
models. The MOF is a key component in the CORBA Architecture as well as the 
Common Facilities Architecture. The MOF uses CORBA interfaces for ereating, 
deleting, manipulating meta objects and for exchanging meta models. The intention is 
that the meta-meta objects defined in the MOF will provide a general modelling 
language capable of specifying a diverse range of meta models (although the initial 
focus was on specifying meta models in the Object Oriented Analysis and Design 
domain). The next phase of CRISTAL research intends to help verify this. 
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Abstract. The increasing importance of better customisation of industrial prod- 
ucts has led to development of configurable products. They allow companies to 
provide product families with a large number, typically millions of variants. 
Description and management of a large product variety within a single data 
model is challenging and requires a solid conceptual basis. This management 
differs from the schema evolution of traditional databases and the crucial dis- 
tinction is the role of schema. For configurable products, the schema changes 
more frequently and more radically. In this paper, the characteristics that distin- 
guish configurable products from traditional data modelling and management 
are investigated taking into account both the evolution of the schema and the in- 
stances. In this respect, the existing data modelling approaches and product data 
management systems are inadequate. Therefore, a new conceptual framework 
for product data management systems of configurable products that allows rela- 
tively independent evolution of schema and instances is proposed. 



1 Introduction 

To satisfy the demands of individual customers, companies need to provide a larger 
variety of products. Products, or product families, that allow large variation in a rou- 
tine manner are called configurable products. The routine adaptation necessitates that 
the product family is pre-designed to cover a variety of situations. One challenge is to 
find the correct concepts for modelling configurable products so that the descriptions 
can be kept up to date as the product evolves. The challenge has not been adequately 
answered by the research in product configuration [1,2], product modelling standards 
[3, 4] or commercial product data management (PDM) systems [see, e.g., 5]. 

In the following, we argue that a major reason for the inadequacy in the solutions is 
the problematic role of schema of configurable products. From the product modeller’s 
viewpoint, modelling of configurable products involves the following levels [3, 4]: 
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- The basic concepts of the chosen data modelling method. For example, ‘cla.ss’, 
‘attribute’, ‘classification’, ‘inheritance’ and so on. 

- The product modelling concepts described using the basic concepts. Examples 
include ‘product’, ‘has-part relation’, ‘optional part’ and ‘alternative parts’ and 
conditions, such as ‘incompatibility of components’. 

- Product family descriptions formed using the product modelling concepts. These 
descriptions are also called configuration models. 

- Finally, product individuals are instantiated according to the configuration model. 
From ontological viewpoint, the two first correspond to top level ontology and domain 
ontology, respectively [6], and the two latter are an application of the domain ontol- 
ogy to a specific case. Data modelling, however, typically differs from the above view 
by modelling the world using the levels concepts, schema and individuals. 

- Concepts are principal data modelling elements for articulation about the world. 

- Schema is the actual model of the world; it defines the entity types of the world. 

- Individuals constitute the actual population of data that can be queried and manipu- 
lated. Individuals in a database represent the individuals of the world that the 
schema describes. 

Conftguration modelling does not fit nicely into the data modelling view for various 
reasons. Firstly, configuration models need new concepts, such as ‘has-part relation 
with alternative parts’, not available as traditional data modelling concepts. 

Secondly, as companies develop their products, product family descriptions are 
constantly changed. Product family descriptions, however, conceptually correspond to 
a traditional schema. In traditional database schema evolution, the existing data is 
typically converted to reflect the changes in the schema. In product development, this 
is not the ca.se — old product individuals are not typically converted as product is de- 
veloped. The evolution of a schema in product modelling, therefore, significantly 
differs from traditional schema evolution. 

Thirdly, individuals of configurable products have long lifetimes and histories of 
their own. More importantly, they do evolve independently of the schema since cus- 
tomers may change their product individuals as they please. This is a crucial differ- 
ence to traditional databases since the evolution of individuals is not reflected in the 
schema and consequently, the individuals do not necessarily conform to the schema. 
Modelling such individuals necessitates flexibility, for example, relaxing the strict 
conformance of individuals to the schema. 

Therefore, in this paper we search for a suitable mechanism for modelling the evo- 
lution of configurable products. The mechanism should capture both the evolution of 
the schema and the individuals for an environment in which the conformance of indi- 
viduals to the schema is not constantly maintained. 



2 Previous Work 

The basic goal in product configuration modelling is to provide means for describing 
a large set of product variants by a single data model. Methods of artificial intelli- 
gence have been extensively used for modelling the conditions that define the valid 
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combinations of components and for finding a solution for particular requirements [1, 
2], There are also approaches for representing product variety based on the product 
structure, such as generic Bills-of-Materials or generic product structures [see, e.g., 
7], and approaches that emphasise product structure and classification in configura- 
tion modelling [8, 9], Although the difficulties in managing configuration models are 
widely recognised, the vast majority of the models ignore all aspects related to the 
evolution of configuration models and configurations. In this paper, we aim at provid- 
ing the concepts for capturing such evolution. 

Version is a concept for describing the evolution of products [10, 11]. Versions 
typically represent the evolution of a generic object that, in turn, has a set of versions. 
A generic object is sometimes also called a generic version or generic instance [12, 
13, 14, 15]. There can be two kinds of component references to versions: statically 
bound and dynamically bound [13]. A statically bound reference specifies explicitly a 
component and one of its versions. A dynamic component reference, i.e., a compo- 
nent reference to a generic object, can be bound to a specific version at various points 
in time. The idea is that in a given context, one of the versions from the set is a repre- 
sentative for the generic object, that is, the one to which dynamic references are 
bound. In the following, we utilise the version set approach, but for both the schema 
and the individuals in parallel. 

Schema evolution is a relevant issue for databases. The approaches to schema evo- 
lution can be classified to filtering, persistent screening and conversion [ 1 6]. In filter- 
ing, a database manager needs to provide filters so that an operation defined on a 
particular type version can be applied to any instance of any version of the type [17]. 
Persistent screening defines how the instances should be modified to accommodate a 
schema modification, but the actual conversion takes place only when an instance is 
accessed for the first time [18]. Conversion, on the other hand, modifies all instances 
right after the schema modification [16]. 

A principal assumption in database schema evolution is that instances can be con- 
verted. In product configuration, however, this is not always possible. For example, 
product development may have obsoleted some components and thus removed them 
from a new configuration model. Therefore, the old product instances that were con- 
figured and manufactured according to an old configuration model are not represented 
by the new model. Thus, automatic conversion of old product individuals to the new 
schema is not meaningful. Because individuals cannot be converted, it is natural to 
have individuals from different versions of schema. 



3 Evolution of Schema and Individuals 

In this section, we position this work by identifying four basic cases according to 
whether the evolution of the schema and/or the individuals is supported. By support- 
ing evolution, we mean here that the history or consequences of a change are sup- 
ported in a more advanced form than just by allowing modifications without leaving 
any traces in which order they were done. 
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3.1 Category 1: No Support for Evolution 

No support for evolution means that there is no memory or history of the schema or 
the individuals. Typically, the schema is assumed static and the individuals are mu- 
tated “in place”. These mutations arc typically required to be such that the individuals 
are always (outside transactions) consistent with respect to the schema. 

As an example, assume 'X-Bike’ with two models ‘Standard X’ and ‘Deluxe X'. In 
the schema, ‘Standard X’ and ‘Deluxe X’ would be subtypes of ‘X-Bike’. Typically, 
individuals would be created only as instances of ‘Standard X’ and ‘Deluxe X’, for 
example, ‘Deluxe X, serial #1183’. 



3.2 Category 2: No Schema Evolution, but Individuals Evolve 

Although most databases do not have any concept of history, some databases store the 
history of individuals. With history stored, one can go back in time and see how 
things were at some point. For example, one can find the description of a product 
individual at the time it was delivered. With more advanced temporal querying capa- 
bilities information can be retrieved using temporal relations. Nevertheless, in this 
category the schema is assumed stable although the evolution of individuals is sup- 
ported. 

With respect to the ‘X-Bike’, individuals such as ‘Deluxe X, serial #1 183’ would 
have multiple data representations. For example, ‘as-manufactured version’ of ‘De- 
luxe X, serial #1 183’ would record the serial number of its critical parts, e.g., the 
frame, and customer specific options installed. Thereafter, all service operation are 
carefully recorded (this is a deluxe bike!). For example, as part of the regular service 
on March 15, 1999, the handlebar was changed to a new model ‘Sporty W’. 



3.3 Category 3: Schema Evolves, Individuals Do Not 

Evolution of schema, when supported in a database system, typically propagates the 
modifications of the schema to instances as was discussed earlier. 

For example in 1999, a new mountain bike model ‘Bike XM’ is introduced for 
‘Bike X’. The old models are still manufactured, although some components are re- 
placed due to problems in durability and some due to the change of a subcontractor, 
both leading to new versions of ‘Bike XM’. With some extra work, as-manufactured 
information can perhaps be found, but no explicit records of the individuals are kept. 



3.4 Category 4: Both Schema and Individuals Evolve 

This is the category most relevant to configuration modelling. This is actually what 
happens in the real world. The schema, i.e., the description of a product family, 
evolves as products are being developed. The developments in products cannot be 
propagated to the individuals automatically, as was discussed above. The schema 
evolves in this respect independently of the individuals. Customers, on the other hand. 
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can modify Ihcir product individuals practically as they please, that is, very much 
independently of the schema. Although, total freedom cannot be systematically sup- 
ported, the relation of the schema and individuals should be maintained as far as pos- 
sible. 

For ‘Bike X’ this means that product families are modified and created, partly util- 
ising the same components. The individual bikes are manufactured according to the 
product family descriptions, and the modifications to the individuals are recorded. 

Many models lack the support for the evolution of the schema and individuals be- 
cause of conceptual or practical simplifications rather than because it would not be 
needed. Especially in product data management, traceability and after-sales services 
need a history of the schema and individuals. If the evolution is not supported by the 
database management systems, the mechanism must be implemented on top of the 
existing system. The goal of this paper is to find solutions that could be implemented 
in RDM systems of configurable products. The important questions are “what does 
versioning of schema and individuals mean?” and “how do they relate?” 
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Fig. 1. Different approaches with respect to support for evolution of schema and individuals 

In Fig. 1, the support for evolution of schema is shown on the horizontal axis, 
whereas the vertical axis shows the support for individuals. These axes generate a grid 
for categories 1-4, into which different approaches are placed. 



4 Conceptual Model 

It is impossible to discuss the evolution of sehema and individuals unless there is 
some kind of agreement on their contents. Thus, we need to define some concepts. 
However, in order to keep a clear focus, there should be as few concepts as possible, 
and yet the concepts should capture the essential characteristics of configurable prod- 
ucts. The objective of this paper is not to formally define the semantics of the con- 
cepts, the aim is more at providing an informal intuition of them. We build on our 
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previous work on product configuration (8, 9, 19] and, in this paper, add to that the 
evolutionary aspects. 

We model both the schema and the individuals with the same concepts. The basic 
concepts arc object and relation between objects. They provide a general and rather 
natural conceptual basis, which is also exemplified by their central role in many mod- 
elling formalisms, such as predicate logic, set and relation theory and so forth. 

An object is either a generic object or a version. A generic object has a set of ver- 
sions, and each version belongs to exactly one generic object. We use the term type to 
refer to an object in schema. A type is then either a generic type or a type version. 
Objects representing individuals, which we simply call individuals, are either generic 
individuals or individual versions. Note that the term ‘object’ refers to both the ‘type’ 
and the ‘individual’ and that we model the evolution of individuals by means of ver- 
sioning. This means that all changes are implemented by creating new versions; ver- 
sions themselves are not modified! 

All relations can basically be reduced to relations between versions. For example, 
if A and B are generic objects, a relation between them is interpreted so that “each 
version of A is related to some version of B”. This is the general idea, which we will 
elaborate further when discussing specific relations. Sometimes for simplicity, for a 
relation from A to B we say that A refers to B. 

We use effectivity to organise the set of versions. Effectivity is the period the ver- 
sion was or is effective as a representative for the generic object. For modelling effec- 
tivity, we assume discrete time and relate each version with an effectivity interval. An 
effectivity interval consists of two time points, start time and end time. The end time 
is optional; an undefined end time meaning that the version is currently effective. The 
effectivity of a generic object is the union of effectivities of its versions. For simplic- 
ity, we also assume that at a particular time at most one version is effective for a ge- 
neric object and that the effectivity intervals of consecutive versions meet. For resolv- 
ing generic references, we search something more than simply using global time with 
effectivities. For example, we do not assume that an individual version effective at t is 
necessarily defined by a type version effective at t. Such situation results when prod- 
uct descriptions are modified because of product development but the product indi- 
viduals are not converted. That is, the correct description, i.e., product type version, 
for an old product individual is not the new, effective one. 

The association between individuals and types is modelled by two relations; is- 
instarice-of and conforms-to. An individual is related to a type by is-instance-of rela- 
tion. The “validity” of an individual with respect to a type is modelled by means of 
conformance, which is a condition stating whether the individual conforms-to the type 
it is related to by is-instance-of. With explicit is-instance-of, we want to stress that the 
type of an individual has been explicitly decided. Due to evolution, however, an indi- 
vidual may not be a valid representative of its type; this is modelled by conformance. 
These concepts are complex and powerful as such and the situation becomes even 
more complicated when generic objects and versions are introduced. Therefore, we 
try to give an intuition what we want to achieve with the concepts without unnecessar- 
ily going into too many details. 
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The concepts are roughly illustrated in Fig. 2, in which generic objects are repre- 
sented by ovals. Inside an oval are the versions of the generic object as small circles, 
each with an effectivity interval next to it. Solid-line arrows represent is-instance-of 
relations and dotled-linc arrows relations in schema and between individuals. 

The concepts provide the basics for recording the history of objects. The concepts 
can be used in various ways, and therefore, the modelling of evolution is far from 
solved by just providing the basic concepts. Next, we enhance the model to support a 
meaningful evolution of schema and individuals together by defining informal invari- 
ants. The role of invariants is to constrain the underlying (imaginary) language and so 
approximate the intended situations [6]. A PDM system can then implement a selec- 
tion of invariants to provide the wanted semantics. 




Fig. 2. Example of basic concepts in schema and individuals and relations between them 



4.1 Is-instance-of Relation and Conformance 



Is-instance-of is a relation between individuals (generic or versions) and types (ge- 
neric or versions). For an individual, it denotes the type of the individual (if any). The 
validity of an individual with respect to its type is controlled by means of confor- 
mance as was discussed above. In particular, we are interested in is-instance-of rela- 
tion with generic individual or type. 



The basic intention is that each individual has a type and conforms to that type. 
More precisely, we mean conformance of an individual version to a particular type 
version although the is-instance-of relation can be between generic objects. We define 
two invariants, a strong and a weak, to articulate the semantics we want for configur- 
able products. With the invariants, we also stale to which type version the individual 
should conform in case of generic references. 



Strong conformance invariant: An individual is constantly kept in confer- (I) 
mance to the type it is-instance-of. 




Evolution of Schema and Individuals 



19 



The invariant 1 requires that at any given I an effective individual version is-instance- 
of an effective type version to which it also conforms. With respect to schema modifi- 
cations, the invariant I covers the schema evolution of databases when individuals are 
converted immediately after a schema modification. Semantics for delayed conversion 
would be achieved if the conformance were required only for the creation time of 
each individual version. That would allow a generic individual to remain untouched 
regardless of schema changes. However, as soon as the generic individual is accessed, 
a new version that conforms to the effective type version is created. The semantics for 
filtering approaches, in which individual versions live according to their original type 
version, would be slightly different. For them, the is-instance-of is allowed only from 
a generic individual to the type version effective at the time the generic individual was 
created and each individual version is required to conform to that type version. 

In all these approaches, each individual version conforms to some type version. For 
configuration modelling, however, this is too strict. For example, when a product 
individual is modified by changing components in it, the result is typically a mixture 
of original components and some new ones. Consequently, the modified product indi- 
vidual (version) does not necessarily conform to any configuration model. Therefore, 
we also define a weak conformance invariant. 

Weak conformance invariant: The first version of a generic individual con- (!') 
forms-to the type version it is-instance-of at the time of its creation. 

A generic individual, i.e., its first version, is now created as an instance of a type, but 
may thereafter evolve independently of it. A conversion of a generic individual to a 
new type version is represented by creating a new, converted version and relating it 
by is-instance-of to the correct type version. (Note that the invariant 1 implies T.) 

When a generic individual is-instance-of a generic type, it means that all versions 
of the generic individual are instances of some version of the type. If an individual 
version is-instance-of a type, this does not say anything about the consecutive ver- 
sions of the generic individual. So, a generic individual can be used in is-instance-of 
relation to control the type of its future versions. This requires the creation of a new 
individual if the relation needs to be changed, for example, to a new type version. 



4.2 Is-a Relation and Inheritance 

Is-a is a relation in schema that serves mainly two purposes. First, it organises the 
types into a classification taxonomy, in which the “lower” objects are specialisations, 
also called subtypes, of the “higher” ones, also called supertypes of the former. This 
ordering bears the idea that the subtypes can be used in place of their supertypes. 
Second, is-a relation provides a mechanism for sharing common properties by means 
of inheritance from supertypes. We define invariants for controlling the effects that 
changes have via inheritance. 

Strong effectivity invariant for is-a: Effectivity of each (sub)type version (2) 

must be contained in the effectivity of single version of its supertype 
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Consequently, a generic type may version without a need to update its supertypes, but 
for its subtypes, new versions need to be created. The invariant 2, therefore, guaran- 
tees that the inherited properties of a type version remain unchanged. This invariant 
provides a strict modelling basis in which the properties of a type version never 
change during its effectivity. 

Existence of supertype invariant: Effectivity of type must be contained in the (2') 
effectivity of a (super)type it is-a (subtype of). 

The invariant 2' is a weaker form of the invariant 2. The main difference between 
them is that the invariant 2' requires only that an effective version for the supertype 
exists, not that the changes in it are propagated downwards. Consequently, the inher- 
ited properties of a type version may change during its effectivity. 



4.3 Has-part Relation 

Has-parl relation differs from is-a by occurring in schema and between individuals 
(but not between an object in schema and an individual). The detailed semantics of 
has-part [8] is not important here since we are interested in the evolutionary aspects. 
We define similar invariants as for is-a. 

Strong effectivity invariant for has-part: Effectivity of a version must be (3) 

contained in the effectivity of a version it has as part. 

Existence of part invariant: Effectivity of an object must be contained in the (3‘) 

effectivity of an object it has as part. 

The invariants are written so that they can be applied to has-part between schema as 
well as between individuals. For has-part between types, the former corresponds to 
the versioning semantics in which a modification to a component is propagated to all 
wholes using it as part. The latter corresponds to the semantics in which components 
may version independently, typically as long as the changes are internal to the com- 
ponent. 

Discussion on the use of generic individuals in has-part is similar to that of is- 
instance-of. A has-part relation from a generic individual states that all its future ver- 
sions have the part, which may be too strong. For types, however, one may want to 
make such a statement thus requiring a creation of a new generic type, not only a new 
type version, in case the part needs to be changed. A has-part to a generic individual 
allows modification, e.g., servicing, of an individual component without the need to 
change the whole (in case the existence invariant 3' is used). 



4.4 Combination of Is-instance-of, Is-a and Has-part 

We have now defined three kinds of invariants; strong, existence and weak. The weak 
invariant was defined only for conformance since we wanted to support the evolution 
of individual rather independently of the types. We could also have defined a weak 
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invariant for is-a as well. That would be an invariant that requires only the creation 
time of a type to be contained in the effectivity of its supertype. Such invariant would 
allow evolution of type independently of its supertypes. We wanted such independ- 
ence between individuals and types, but not for the type hierarchy, for which we want 
to control the evolution more strictly. 

For a data management system, the important question is what forms of the invari- 
ants should be selected. For a traditional database, the strong ones are typically the 
most appropriate as has already been discussed. Our intention in this paper is to find a 
suitable selection for a PDM system of configurable products. A reasonably good 
choice of semantics for such system could use: 

- weak conformance invariant for is-instance-of (i.e., 1 '), 

- strong effectivity invariant for is-a (i.e., 2) and 

- existence of part invariant for has-part (i.e., 3'). 

This selection of semantics requires strict control on the changes to the classification 
hierarchy. That is, a change in a type necessitates its propagation to the subtypes and 
the properties of type version cannot change during its effectivity. 

Has-part, however, reflects the evolution of components in a company. It is typical 
to allow certain modifications to a component, i.e., creation of new versions, without 
propagating them in the product structures. The allowed modifications are sometimes 
defined as those that maintain the “form, fit and function” of the component. Such 
semantics is achieved with existence invariant. For individuals, this also allows modi- 
fication of a component individual without versioning the whole product individual. 

For the relation of individuals to their types, we have already argued that changes 
to product definitions are not automatically propagated to the individuals and the 
individuals may be modified independently of their types. However, product indi- 
viduals should be created according to a product type. This is captured by the weak 
conformance invariant, which requires the creation time conformance of the generic 
individual but allows independent evolution thereafter. This means that only the first 
version of a generic individual must conform to its type. If the generic individual is 
later converted to another type (version), this can be recorded by creating a new, con- 
verted individual version as is-instance-of the type. This, however, requires that one 
has defined is-instance-of relation for the first version of the generic individual, not 
for the generic individual itself. 



5 Conclusions 

Problems addressed in this paper are those of configurable products. The concepts 
presented, however, do not directly reflect the concepts of configurable products. The 
explanation is that the need for dealing with both the evolution of product types and 
the product individuals is characteristic of configurable products. In mass-products, 
the evolution of single product individual is not recorded. In very complex project 
products, there is no model from which the individuals could be instantiated. Config- 
urable products are somewhere in the middle combining the routine aspects of mass- 
products, e.g., the pre-defined product types, and uniqueness of product individuals of 
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complex products, for which the evolution of individuals is more interesting. Config- 
urable products thus provide a problem of their own for the evolution in data model- 
ling systems. We provided a conceptual framework for managing such evolution. 

There are other data modelling approaches that might support the needed evolu- 
tion. One special approach is to abandon the distinction between clas.ses and instances 
altogether [19]. This provides some degree of freedom for describing configurable 
products but cannot escape the problems of evolution [20]. Even with no classes and 
instances, there will be class-like objects and objects that represent individuals. 

Schema evolution of databases, on the other hand, is based on the principle that 
schema constantly represents the same (or at least almost the same) real world enti- 
ties. Consequently, the conversion of old instances is meaningful. For configurable 
products, this does not hold and therefore, we had to discard the strict conformance 
between the schema and individuals. 

As most versioning approaches concentrate on modelling design objects, the meth- 
ods have not been applied to configuration models. Typically versioning of design 
objects is represented at instance level and the evolution of schema, if present, is simi- 
lar to the schema evolution of databases [15, 21, 22]. 

In product modelling, the largest initiative is the STEP standardisation effort [23]. 
STEP addressed the modelling of products for their whole lifetime. Modelling the 
evolution of product individual is well catered for, but there are problems in repre- 
senting product families, not to mention their evolution [4]. 

It is hard to provide semantics that would suit for all PDM system of configurable 
products. Nevertheless, we provided a selection of invariants that covers the most 
typical of cases. The conceptual framework is derived from the experiences we have 
with two dozen companies that are manufacturing configurable products. Therefore, 
we dare to claim that although the feasibility of the approach presented has not been 
validated, it is not hypothetical. 
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Abstract. A substantial number of temporal extensions to data models 
and query languages have been proposed. However, little attention has 
been paid to the migration of data and applications from a “snapshot” 
DBMS to a temporal extension of it. In this paper, we analyze this issue 
and precisely formulate some requirements related to it. We then present 
a temporal extension of the ODMG’s object database standard fulfilling 
these requirements. Throughout this presentation, we underscore the im- 
portance of providing adequate update and interpolation modalities in 
achieving application migration support. 

Keywords: temporal databases, application migration, schema evolu- 
tion, temporal updates, object-oriented databases, ODMG 



1 Introduction 

Temporal databases aim at integrating time as a primitive concept in DBMS. 
Research in this area has given birth to a substantial number of temporal data 
models, most of which are actually extensions of existing “snapshot” ones. A 
common rationale for temporally enhancing snapshot data models rather than 
designing temporal ones from scratch, is that the resulting models can be easily 
integrated into existing systems. The applications built on top of these systems 
may then rapidly benefit from the added technology. 

However, the smooth migration of applications running on top of snapshot 
database systems to temporal extensions of them, imposes some compatibility re- 
quirements. Surprisingly, this issue has been neglected by the temporal database 
community until relatively recently, when notions such as seamless integration 
of time [7] and temporal upward compatibility [1, 12] have been studied. 

The first part of this paper precisely defines some requirements related to 
data and application migration towards temporal DBMS extensions, with an 
emphasis on object-oriented ones. The main originality of the adopted approach 
is to view this problem as a particular case of schema evolution. 

We then present a temporal extension of ODMG’s object database model [3] 
fulfilling the above migration requirements. This extension directly stems from 
the Tempos object-oriented temporal database framework [6,4]. It defines a set 
of abstract datatypes for handling temporal values and histories, and provides 
temporal extensions of the main ODMG components: the object model, the 
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schema definition language and the query language. In this paper we focus on 
the update and interpolation modalities defined by the extended object model, 
and put forward their major role in fulfilling the migration requirements. 

The paper is organized as follows. Section 2 introduces the concepts related 
to application migration towards temporal DBMS. Section 3 overviews the pro- 
posed extension of ODMG’s model. A description of the update operators pro- 
vided by this extension is given in section 4. Section 5 shows how application 
migration is achieved. Finally, section 6 concludes and discusses future works. 



2 Application migration requirements and related works 

Following [7, 1], we are interested on specifying migration requirements between 
pairs of data models defined as follows. 

Definition 1 {data model). A data model M is a quadruplet (D, Q, U, || m) 
composed of a set of database instances D, a set of legal query statements Q, 
a set of legal update statements U, and an evaluation function [1 m- Given an 
update statement u s U, a query statement q £ Q and a database instance db 
e D, [u(db)| M yields a database instance, and [q(db)l m yields an instance of 
some data structure. □ 

Hence, a database instance is seen as an abstract entity, to which it is possible 
to apply updates (statements whose evaluation map a database instance into 
another one), and queries (statements whose evaluation over a database yield an 
instance of some data structure). 

We successively introduce two levels of migration requirements: upward com- 
patibility and temporal transition support. The definitions that we provide may 
be seen as adaptations to the object-oriented framework, of the notions of up- 
ward compatibility and temporal upward compatibility introduced in [1, 12]. 
Definition 2 {upward compatibility). A data model M' = (D',Q',U',|| m') is 
upward compatible with another data model M = (D,Q,U,[Jm), iff: 

- D C D', Q C Q’ and U C U' (syntactical upward compatibility in [1]) 

- For any db in D, for any q in Q, for any ui, U 2 , . . . , u„ in U and for any 

instants di, ... dn, dn+i: 

[q''-’+Hu^"( ... uf( uf(db))))lM = [q'^'’+l(u^"( ... uf(uf(db))))]M' 



□ 



Notice that in this latter expression, updates and queries are parameterized 
by an instant. This is because, in some temporal data models, the semantics of 
queries and updates depend on the instant at which they are issued. 

In the setting of ODMG, the set of queries Q not only includes those which 
are submitted to the OQL interpreter, but also, all accesses to class extents and 
object properties via application programs. Similarly, updates include object 
creations and deletions, as well as updates to object properties. We will examine 
all these update operators in section 4, and define temporal extensions of them. 
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To illustrate upward compatibility, consider an ODMG compliant DBMS 
managing a database about document loans in a library. Upward compatibility 
states that if the ODMG DBMS is replaced by a temporal extension of it, the 
application programs accessing these data may be left intact. This implies that 
the set of database instances recognized by the extension is a super-set of those 
recognized by the original DBMS, and that the query and update statements 
have identical semantics in the original DBMS and in the temporal extension. 

Now, suppose that once the legacy applications run on the temporal exten- 
sion, it is decided that the history of the loans should be kept, but in such a way 
that legacy application programs may continue to run (at worst they should 
be recompiled), while new applications should perceive the property as being 
“historical” . In the sequel we call this requirement temporal transition support. 

While upward compatibility is achieved by adding new concepts and con- 
structs to a model without modifying the existing ones, temporal transition 
support is more difficult to achieve. Indeed, [1] shows that almost none of the 
existing temporal extensions of SQL, including TSQL2, satisfy this requirement. 

The same remark holds for existing object-oriented temporal extensions, and 
in particular for those concerning ODMG (e.g. TAU [8] and T.ODMG [2]). For 
example, consider a class Document with a property loaned_by. In TAU, if some 
temporal support is attached to this property, then any subsequent access to it 
will retrieve not only the current value of the document’s loaned-by property (as 
in the snapshot version of the database), but also its whole history. 

TOOBIS [14] (another ODMG temporal extension) does not exhibit this 
latter problem. However, in achieving temporal transition support, TOOBIS 
introduces some burden to temporal applications. Indeed, in TOOBIS TOQL 
for instance, each reference to a temporal property in a query should be prefixed 
by either keyword valid, transaction or bitemporal, leading to cumbersome query 
expressions. This approach is actually equivalent to duplicating the symbols 
for accessing data when adding temporal support, in such a way that for each 
temporally enhanced property x, there are actually two properties representing 
it in the database schema, say x and temporal_x. 

We propose a different approach: when temporal support is added to some 
component of a database schema S, yielding a new schema S', application pro- 
grams are divided into two groups: those which view data under schema S, and 
those which view it under schema S'. Therefore, the problem of temporal tran- 
sition support is seen as a particular case of schema evolution, and techniques 
developed in this context apply. More precisely, our approach to temporal tran- 
sition support is based upon the notion of bi-accessible temporal data model 
defined as follows. 

Definition 3 {Bi-accessible temporal data model). A bi-accessible temporal data 
model is a pair of data models (Ms, My), Ms = (Ds, Qs, Us, |J Ms) My = 
(Dy, Qy, Uy, |1 Mt)> such that: 



1. Ms is a snapshot data model upward compatible with respect to temporal 
data model My 
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2. for any dbj € Ds, there exists a dbt € Dj, such that, for any q € Qs, for any 
ui, U 2 , . . . , Un G Us and for any instants di, ... dn, dn+i: 

... uf(uf(db,))))]M. = Iq‘"’+Hu^'’( ... uf (uf (dbO)))l m. 

□ 

The process of mapping dbj into dbt is usually termed database adaptation 
or database conversion in the schema evolution terminology. We elaborate on 
this issue in section 5 where we also introduce a view mechanism simulating 
bi-accessibility on top of the proposed ODMG extension. 



3 A temporal extension of ODMG’s object model 

In this section, we overview the proposed extension of ODMG’s model. First we 
briefly present the temporal datatypes. Next, we show how the basic abstractions 
of ODMG’s model are temporally extended. For more details, the reader may 
refer to the bibliography on Tempos [6,4], upon which this extension is based. 

3.1 Temporal values and histories 

Tempos is based upon a discrete, linear and bounded time model in which the 
time line is structured into time units. A time unit defines the precision at which 
time is observed in a particular context. Common units include year, month and 
day. The set of time units is extensible. 

Based on this temporal structure, the Tempos model defines three basic 
temporal datatypes: Instant, Duration, and Set of instants. Intervals are viewed 
as particular cases of sets of instants. 

The History Abstract DataType (ADT) models functions from a finite set 
of instants observed at a fixed granularity, to a set of values of a given type. 
The domain and the range of a history are respectively called its temporal and 
structural domain. Two selectors of the History ADT (respectively TDomain and 
SDomain) allow to retrieve these two components of a history. 

A set of algebraic operators is defined on histories. These operators include 
restrictions, joins, extended set operators, grouping operators, aggregations and 
operators for reasoning about succession in time. In addition, a language for 
describing patterns of histories has been defined in [4] . 



3.2 Temporal support at the property level 

In ODMG, a property is defined as an attribute or traversal path of a relationship 
attached to some class. For instance, possible properties of an Employee class 
include salary and department. As classes, properties may be instantiated and 
instances of properties are attached to objects. More precisely, an object is made 
up of a unique identifier and a collection of property instances. Each of these 
property instances has a value attached to it which may be accessed and updated 
through predefined methods defined over the Property interface. 
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In the proposed extension, a property is temporal if the successive values of its 
instances are recorded, or else fleeting if only the most recent value is recorded. 
When a property is temporal, each of its instances has a history attached to it, 
whose granularity is fixed by the observation unit of the property. 

As in ODMG, a type is attached to a property. In the case of a fleeting 
property, the type models the domain of possible values of an instance of this 
property. In the case of a temporal property, it models the domain of the possible 
structural values of the histories attached to the instances of this property. 

The temporal dimension of a temporal property determines the semantics of 
the temporal associations that it models. It may be valid-time or transaction- 
time depending on whether the facts are timestamped with respect to the mod- 
eled reality or with respect to the database evolution [11]. The corresponding 
properties are refered to as valid-time or transaction-time properties. 

The observation temporal domain (or observation domain in short) of a tem- 
poral property instance, is the set of instants during which the property is ob- 
served for a given object. 

As stated above, each temporal property instance has a history. The temporal 
domain of this history is equal to the observation domain of the property instance 
itself. Its structural values are either defined by some update, or derived from 
the values provided by the user by means of a semantic assumption. 

More precisely, each temporal property instance has an effective history, cor- 
responding to the input timestamped values attached to it. The effective history 
is contained in (but not necessarily equal to) the property instance’s history. 
The difference between these two histories is called the potential history (i.e. the 
part of the history calculated using the semantic assumption). In the sequel, the 
temporal domain of a property instance’s effective history is called its effective 
temporal domain, or effective domain in short. 

We distinguish three particular semantic assumptions depending on the in- 
tended calculation mode of the potential history (see figure 1): 

- Discrete: the structural values of the potential history are all equal to the 
neutral value of the structural type (e.g. 0 for integers, nil for objects) or to 
some padding value provided by the user. This is the case of the production 
of some product in a factory: the period of time during which the production 
is defined (its observation domain) may be known in advance (e.g. all week- 
days). However, at some days, it may happen that there is no production 
(e.g. due to a strike), so that the effective history is undefined for those days. 

- Stepwise: structural values are “stable” between two instants in the effective 
domain (e.g. a property instance modeling an employee’s salary). As for 
discrete properties, a padding value is attached to a stepwise property to set 
the structural value of its instances at those instants for which the stepwise 
assumption does not provide one (e.g. if the smallest instant in the effective 
domain is different from the smallest instant in the observation domain). 

- Linearly interpolated: this kind of interpolation applies only to numerically- 
valued properties. Between two “successive” instants in the effective history, 
the structural value varies linearly. 
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Other assumptions may be defined depending on the needs of applications, 
and the characteristics of the involved structural type. 



temporal property characteristics property instance history 




Fig. 1. effective history and semantic assumptions 



3.3 Temporal support at the class extent level 

As properties, classes may be fleeting or temporal. Declaring a class as temporal 
attaches a “lifespan” (called observation domain) to each object of the class. 
Thereby, it allows to determine the set of objects in the class extent which are 
observed at a given instant. 

Conceptually, the observation domain of a temporal object, is the set of 
instants during which the information conveyed by this object is relevant to 
the applications and thus observed. The notion of “observation” may either be 
defined with respect to transaction-time, or to valid-time. 

For instance, consider a class Product modeling the product types provided 
by a company. If the class is declared as temporal (either with respect to valid- 
time or transaction-time), then the observation domain could model the time 
when a particular product is produced. In addition, if the class is transaction- 
time, the observation domain of an object of this class captures the time when 
the database knows that a product is produced, whereas if the class is valid-time, 
it models the time when the corresponding product is produced in reality. 

In accordance with ODMG, the extent of a class (whether fleeting or tempo- 
ral) is the set of all instances of this class having been created and not deleted. 
In the case of temporal classes, we distinguish the extent of the class as defined 
above, from the observed extent at an instant i, which is the subset of the extent 
consisting of all objects whose observation domain contains i. 

A given application may either manipulate the whole extent of a class, or the 
extent at some instant. For instance, a snapshot application accessing a temporal 
class, is likely to be interested only in the observed extent of the class at the 
current instant, as discussed in section 5.2. 
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4 Database updates and accesses 

4.1 Temporal property instances’ evolution 

In ODMG, there is one access and one update operator for property instances, 
respectively get.value, which retrieves the value of the property instance, and 
set-value which assigns to the property instance, the value given as parameter. 

The set of updating and accessing operators on temporal property instances 
is more complex. These operators are classified depending on whether they are 
intended to modify the observation domain or the effective history, and depend- 
ing on the temporal dimension to which they apply. 

Evolution of transaction-time property instances Since transaction-time 
is intended to model the evolution of the database, the observation domain of 
transaction-time property instances evolve automatically with respect to the 
database time (i.e. the system clock). Conceptually, the current instant is added 
to the observation domain of a transaction-time property instance at each system 
clock tick. This automatic evolution of the observation domain can be overridden 
at any time, e.g. to model the fact that the property instance is not observed 
during some period of time. This is achieved through the notion of growth status, 
which takes one of two values: On or Off. If the value of the growth status of 
a transaction-time property instance is On, its observation domain evolves with 
the system clock. Otherwise, it does not evolve at all. Operators turn.on and 
turn.off allow to switch between these two states. 

An overloaded version of ODMG’s set.value operator allows to update the 
effective history of a transaction-time property instance. Operation set_value(v) 
applied to a transaction-time property instance, replaces its effective history by 
a new history, identical to the old one except that it maps the current instant 
to value V. If needed, the growth status of the property instance is turned on. 

Evolution of valid-time property instances Unlike transaction-time prop- 
erties, the observation domain of a valid-time property instance does not evolve 
automatically with the system clock. Instead, an operator set.odomain is pro- 
vided, which destructively replaces the observation domain of the property in- 
stance by the set of instants given as parameter. Since the observation domain 
should contain the effective domain, this operator may force some modifications 
on the effective history. For instance, if the constraint is violated after some 
update to the observation domain, the effective history is restricted to fit inside 
the new observation domain. 

The primitive operator for updating the effective history of a temporal prop- 
erty instance is set.ehistory which replaces the effective history of the temporal 
property by the one given as parameter. The standard set.value operator is also 
supported. Given a valid-time property instant VTPI, VTPI.set.value(v) replaces 
VTPI’s effective history by a new history, identical to the old one except that it 
maps the current instant to value v. 

To enforce the inclusion constraint relating the effective domain of a tem- 
poral property instance and its observation domain, set.ehistory modifies, when 
necessary, the observation domain of the property instance to which it applies. 
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Accessing temporal property instances The primitive access operators for 
both valid-time and transaction-time property instances are get.ehistory and 
get.history. The former retrieves the effective history of the property instance. 
The latter builds a history from the observation domain and the effective history 
using the corresponding semantic assumption^ as depicted in figure 1 page 29. 

Operator get-value is also defined on temporal property instances: it retrieves 
the structural value of the property instance’s history at the current instant. 

4.2 Temporal objects’ observation domain evolution 

The observation domain of transaction-time objects evolves automatically with 
the system clock in a similar way as that of transaction-time property instances. 

More precisely, each transaction-time object has a growth status, which may 
be On or Off. Conceptually, while a transaction-time object is On, the current 
instant is added to its observation domain at each system clock tick. Opera- 
tors turn.on and turn.off allow to modify the value of the growth status of a 
transaction-time object. When an object is created, its growth status is On. 

Regarding valid-time objects, the unique operator provided for updating the 
observation domain is set.odomain, which sets the observation domain to be the 
set of instants given as parameter. 

4.3 Example 

Consider a class Product with a valid-time stepwise attribute price, having struc- 
tural type real. The valid-time observation domain of an object of class Product 
models the time when the product is produced (at the granularity of the day), 
while the observation domain of an instance of property price models the time 
when its price is defined. Table 1 illustrates a possible update scenario. The no- 
tation [i..] (resp. [..i]) designates the interval containing all instants having the 
same granularity as i and greater than or equal to it (resp. less than or equal). 

5 Achieving temporal transition support 

Whenever temporal support is incorporated into previously fleeting classes 
and/or properties, temporal transition support demands that applications may 
continue to access the database as if no temporal support had been added, or 
else, take into account this schema modification. This situation is a rather simple 
case of schema evolution. Our approach to handle it is similar to those described 
in [9, 10]. More precisely, three steps are followed. Firstly, the schema is modified. 
Next, the database is converted to fit the new schema (see section 5.1). Lastly, an 
updatable “snapshot” view of the database is introduced. Contrarily to [10] and 
other related works where the issue of schema evolution is studied in a broader 
setting, we do not employ a complex view mechanism. Instead, we propose an 
ad hoc view mechanism based on the update operators on properties defined in 
the previous section, and on the notion of access mode (see section 5.2). 

^ Transaction-time properties have a stepwise semantic assumption. 
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Scenario 

Operation 

Result 


A product is introduced on 1/4/98 with price 50 
P = new Product; P.set_odomain([l/4/98..]); 
P.price.set.ehistory ( { ( 1 /4/98, 50) }) 
P.price.get-historyO = {{[1/4/98..], 50)} 


Scenario 

Operation 

Result 


On 1/6/98, the product’s price raises to 60 
P.price.set.ehistory(P.price.get.ehistory() U+ {(1/6/98, 60)}) 
P.price.getJiistory() = {([1/4/98..31/5/98], 50), ([1/6/98..], 60)} 


Scenario 

Operation 

Result 


The product is not longer produced starting from 6/8/98 
P.set.odomain(P.get.odomain() D [..5/8/98]) 

P.get.odomain() = [1/4/98. .5/8/98] 

P.price.getJiistoryO = {([1/4/98..31/5/98], 50), ([1/6/98..5/8/98], 60)} 


Scenario 


The product is produced again st 2 U-ting from 1/1/99; its price is not ob- 
served then 


Operation 

Result 


P.set_odomain(get_odomain(P) U [1/1/99..]) 
P.get_odomain() = {[1/4/98. .5/8/98], [1/1/99..]} 
P.price.getJiistoryO unchanged 


Scenario 

Operation 

Result 


The product’s price is observed starting from 1/2/99 and until 31/3/99. 
Its value at 1/2/99 is 80 

P.price.set_odomain(P.price.get_odomain() U [1/2/99. .31/3/99]) 
P.price.set_ehistory(P.price.get_ehistory() U+{(l/2/99, 80)}) 
P.get_odomain() unchanged 

P.price.getJiistoryO = {([1/4/98..31/5/98], 50), ([1/6/98..5/8/98], 60), 
([1/2/99..31/3/99], 80)} 



Table 1. Update scenario 
5.1 Database instance adaptation 

The algorithm below describes how a database instance is adapted to a schema 
modification which adds temporal support, i.e. which transforms fleeting prop- 
erties or fleeting classes into temporal ones. This conversion operator should be 
applied to all existing objects in the database whenever the schema is modified^ . 
Throughout the algorithm, some functions such as classOf and properties are 
used to retrieve metadata about the modified schema. 

This algorithm puts forward the fact that temporal transition support is only 
applicable to stepwise-varying properties. This is because the idea behind tem- 
poral transition support is that when the “current” value of a temporal property 
is modified, the new current value assigned to it should remain constant until 
the next modification. Such a characteristic is inherent to stepwise properties, 
and a fortiori, to transaction-time properties since they evolve stepwisely. 
Algorithm; Object conversion operator 
Input and procedure variables; 

modified-classes: set of classes; /* classes to which temporal support is added */ 
modified-properties: set of properties; /* properties to which temporal support is added */ 
O: Object; /* object to be converted */ 

0_copy: Object; /* variable used to store a copy of O */ 

^ To avoid the complexity of this process on large databases, deferred conversion tech- 
niques may be applied. However, this issue is out of the scope of this paper 
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Procedure: 

0-Copy ;= 0.copy(); /* a copy of O is temporarily stored */ 
Modify the structure of object O to fit the new schema; 
if classOf(0) in modified_classes then 
if (temporalDimension(classOf(0)) = transaction-time) then 
O.turn_on() 

else 0.set-odomain([currentJnstant()..]) 
for p in properties(classOf(0)) 
if (p in modified-properties) { 
if (temporalDimension(p) = transaction-time) then 
O.p.turn_on(); 

else if (semanticAssumption(p) = stepwise) then 
0.p.set_odomain([currentJnstant()..]) 
0.p.set-value(0_copy.p.get-value()) 

} 

else O.p = 0_copy.p 



5.2 Access modes 

Usually, object-oriented programming and querying languages only provide one 
construct for accessing property instances, and one for updating them. For in- 
stance, in C-f-l-, the only way to access the value of an attribute is through the 
“dot” operator, whereas updating is performed through constructs of the form 
o.p = V. The proposed temporal ODMG extension on the other hand, provides 
several update and access primitives for each type of temporal property instance. 
The notion of access mode that we introduce in this section, establishes which 
updating or accessing operator on temporal properties is to be used depending 
on the application context. Two access modes are provided: 

- The snapshot mode: temporal property instances are snapshot-valued and 
their value is defined by the structural value of their history at the current 
instant given by the system clock. In addition, any reference to the extent 
name of a temporal class retrieves the observed extent at the current instant. 

- The temporal mode: temporal property instances are history-valued, and no 
filtering is performed when accessing a temporal class extent (i.e. all objects 
in the extent are retrieved). 

Concretely, in the snapshot mode, whenever a temporal property is accessed 
either from a program or from a query, the value associated to this property is 
retrieved through the get.value operator (see section 4). Similarly, updates in this 
mode are handled by the set.value operator. In the temporal mode, get-history 
and set-ehistory are used instead. 

In fact, the notion of access mode simulates a view mechanism: applications 
using the snapshot mode access a “snapshot” view of the database, while those 
using the temporal mode access the “temporal” view. 

Different access modes may be attached to any two applications accessing 
the same database. As a result, the access mode is a parameter of each appli- 
cation session. The snapshot mode is the default. This design choice is crucial 
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to ensure temporal transition support, since it entails that existing applications 
are automatically classified as “snapshot” during the migration process. 

To illustrate the role of access modes from the querying viewpoint, consider 
a class Product whose extent is named TheProducts, and having two attributes 
trademark and price with types string and real respectively. Now suppose that 
at some time during the life of the application, the class Product as well as its 
attribute price are declared as temporal (trademark remains a fleeting property). 
Now consider the following OQL query: 

select struct(t : b. trademark, p : b. price) from TheProducts as b 

In the temporal mode, this query has type bag<struct<t : string, p : 
History <real»> and retrieves for each product type ever sold, the history 
of its prices. Conversely, if the snapshot mode is assumed, the query type is 
bag<struct<t : string, p : real» and it retrieves the trademarks of all cur- 
rently sold product types with their corresponding prices. 

6 Conclusion and future work 

The main contributions of this paper include the precise formulation of require- 
ments related to legacy code migration toward temporal DBMS, and a proposal 
of an ODMG temporal extension integrating these requirements. 

The formulated requirements are upward compatibility and temporal transi- 
tion support. The former states that a database may be transparently migrated 
from a DBMS to a temporal extension of it. The latter allows legacy code to re- 
main usable when temporal support is added to some components of a snapshot 
database schema, and may therefore be seen as a schema evolution issue. These 
notions were first proposed in the relational framework in [1, 12]. 

The proposed ODMG temporal extension is based on the Tempos frame- 
work [6,4]. It includes a set of ADT modeling temporal values and histories as 
well as temporal extensions of the main ODMG components. Temporal transi- 
tion support is ensured by separating the notion of temporal property from that 
of history and by dividing applications into snapshot and temporal: a temporal 
property may have a historical value in the context of a temporal application 
and a snapshot value in a non-temporal one. Update primitives on temporal 
properties are defined in such a way that updates performed by temporal and 
non- temporal applications are mutually consistent. 

The proposal is being implemented on top of the O 2 DBMS. Up to now, we 
have implemented the temporal types as a library of classes. The OQL extension 
on the other hand, has been implemented by means of a pre-processor. This 
pre-processor recognizes specific statements for setting the access mode of a 
query, thereby integrating the notion of bi-accessibility. Current efforts aim at 
integrating this notion into each of the programming language bindings defined 
by the ODMG standard (i.e. SmallTalk, C-|— I- and Java bindings). 

Having formulated temporal transition support as a schema evolution prob- 
lem, it is straightforward to identify more sophisticated, yet useful alternative 
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definitions of this notion. For instance, in our definition, only a single temporal 
schema is managed at a time. This works fine when the process of converting 
non-temporal data into temporal one is carried out in a single step. However, 
more complex situations may arise. For instance, suppose that a non-temporal 
schema S is modified into a temporal schema S’, and that subsequently S' is mod- 
ified to add temporal support to some of its non-temporal components, yielding 
schema S" . In our approach, two views are managed at the end of this process: 
one with schema S and another with schema S". Hence, applications developed 
under schema S' are not supported! We believe that generalizing our approach 
to handle this kind of situations is an interesting perspective. 
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Abstract. The management of different schema versions is required in 
long-lived database systems to accomplish data structural changes and 
represent their history. Once a suitable data model for schema versioning 
support has been defined, appropriate extensions must also be introduced 
in the data definition and manipulation languages. Such an extension is 
aimed at making the versioning facilities available at user-interface level 
and is the basis for the development of advanced multi-schema appli- 
cations. In this paper we present extensions to the definition and ma- 
nipulation language of the standard object-oriented data model ODMG 
for a generalized schema versioning support. To this end, two version- 
ing modalities will be considered in a single powerful system: temporal 
versioning and management of eilternative design versions. As far as the 
temporal components are concerned, the proposed extensions of ODL 
and OQL will be consistent with the TSQL2 temporal query language. 



1 Introduction 

Databases and software systems have complex structures which are likely to con- 
tinually undergo changes during their lifetime. This is certainly true for OODBs, 
which have been mainly developed to model highly dynamic application scenar- 
ios where not only the data, but also their structure (i.e. schema) is subject to 
change. The need for retaining data entered under any schema definition has led 
to the introduction of the schema version notion. Generally speaking, to manage 
and maintain versions of a schema means dealing with one of the possible repre- 
sentations of the structure of the modeled real world. The application realm of 
schema versions may range from the maintenance of legacy data (formatted ac- 
cording to past schemas) to the reuse of software components, from the planning 
of human activities against alternative scenarios to the management of complex 
design processes. 

In the object-oriented field, two kinds of schema versions have been con- 
sidered: alternative versions (branching approach) and temporal versions. The 
former is typical of design environments (CAD/CAM applications), whereas the 
latter is required to model histories of structural changes (more suitable for GIS 
and multimedia applications). The first model which integrates the two ver- 
sioning modalities, in order to enhance the expressive power and application 
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potentialities of a single OODBMS, is presented in [6]. The model is based on 
an ODMG Release 2.0 [2] extension. The standardization infrastructure offered 
by the adoption of ODMG as a stepping stone is directed towards a high degree 
of portability and interoperability between systems and represents a wide en- 
dorsement of the object-oriented approach. The ODMG standard includes three 
major components: the Object Model, the Object Definition Language (ODL) and 
the Object Query Language (OQL). While [6] represents a desired extension at 
object-model level, the addition of generalized schema versioning support to the 
ODMG standard still needs an intervention at language level. Such an extension 
of ODL and OQL is the main purpose of this work. 

The paper is organized as follows: Section 2 presents a brief overview of the 
model supporting generalized schema versioning; in Section 3 we propose our 
extension to the ODMG languages to make all the model facilities accessible to 
users with an SQL-like interface; conclusions will be found in Section 4. 

2 Outlook of the underlying model 

The object data model we consider here is a general model for the manage- 
ment of versions which integrates the pure temporal schema versioning with the 
branching versioning approach [6] . It has been developed in the framework of a 
comprehensive research project, in which the authors play an active part and to 
which [4-6] and the present work are all contributions. In this generalized model, 
the identification of a particular schema version relies on the use of a symbolic 
name (to denote a design alternative) and two time coordinates (to select a 
temporal version with respect to transaction and valid time [7]). Therefore, a 
multidimensional mechanism is needed to reference distinct versions also at lan- 
guage level. Any version could be referenced through a bitemporal pertinence 
and/or a user-defined name {label). A bitemporal pertinence is defined as a dis- 
joint union of rectangles, where each rectangle is the product of a transaction- 
and valid-time interval (in accordance with the BCDM model [8]). All versions 
having the same symbolic label share a common property since, for example, 
they belong to the same consolidated version in engineering activities or the 
same scenario in GIS. At model level, a database can be represented through a 
directed acyclic graph (DAG), corresponding to the version derivation hierarchy 
used for GAD/CAM applications. Versions having the same label belong to the 
same node in the database graph, whereas the relationships between different 
nodes (e.g. the derivation of a new node or the merger of one node into another) 
are modelled through the graph edges. All the DAG elements are timestamped 
with transaction time in order to keep track of all the schema changes effected 
in the system. 

Schema changes are supported by means of a collection of primitive algebraic 
operations. They are grouped into two sets on the basis of the DAG element on 
which they operate. The set of “schema changes on node” primitives includes 
a complete set of operations acting on the elements of the object-oriented data 
model supported, such as attributes and classes. The set of “schema changes 
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on edge” primitives provides support for the integration of characteristics of a 
schema version with schema versions belonging to other nodes. For example, 
the primitive to merge versions (Merge Version in [6]) creates a new schema 
version by merging schema versions (and their corresponding extant data) be- 
longing to different nodes. A listing of all the primitives supported by the model 
and a definition of their formal semantics can be found in [6]. All the schema 
modification statements considered in this paper can always be implemented by 
means of such primitives, although we will not show details here, for the sake of 
brevity. 

3 Extensions to the ODMG languages 

The proposed ODMG language extensions follow some basic principles: 

— they are SQL-compatible as OQL is very close to SQL 92; 

— the ODL extensions support a complete set of primitive schema changes. 
In particular, they support all semantic constructs for the general schema 
versioning mechanism presented in [6); 

— the versioning granularity is the schema; 

— an internal approach to schema changes is adopted; 

— they are TSQL2-compatible [11] as far as temporal parts are concerned. 

In the object-oriented field, an important aspect which involves the schema ver- 
sioning is the choice of the level at which versioning is supported. The alternative 
is between the single class and the entire schema. We adopt the schema as ver- 
sioning granularity since it automatically provides a complete view of the set of 
class versions tied together in a schema version. In this way, it makes it easier 
to check the inter-class consistency and the management of queries on objects 
belonging to different class versions. 

Following the internal approach to schema changes, when the schema under- 
goes a change, the definition of a complete new version to be added is not allowed 
and the only way of doing it is to apply a sequence of primitive schema changes 
to an already existing version. In this way, we provide the system with de- 
fault semantics for automatic database conversion and an automatic consistency 
checking associated with each schema change primitive [4j. A schema change 
primitive is a non-decomposable operation acting on the schema. The support 
of a complete set of primitive changes allows the execution of any possible schema 
update. In fact, complex schema updates can be effected via sequences of schema 
change primitives. 



3.1 Schema selection 

The schema version selection is achieved through the SET SCHEMA statement 
included as OQL extension. It is basically the same statement introduced for 
TSQL2 [3, 11], augmented with a LABEL clause. Its complete form is: 
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SET SCHEMA <schema selection condition> 

<schema selection condition> ::= 

LABEL <label> 

AND VALID <datetime value expression> 

AND TRANSACTION <datetime value expression> 

It allows default values to be set for label, valid and transaction time to be 
employed by subsequent statements. The value set is used as a default context 
until a new SET SCHEMA is executed or an end-of-transaction command is issued. 
Notice that, although the valid-time and label defaults are used to select schema 
versions for the execution of any statement, the transaction-time default is used 
only for retrievals. Owing to the definition of transaction time, only current 
schema versions (with transaction time equal to now) may undergo changes. 
Therefore, for the execution of schema updates, the transaction- time specified 
in the TRANSACTION clause of the SET SCHEMA statement is simply ignored, and 
the current transaction time now (see Subsection 3.5 for more details) is used 
instead. Moreover, one (or two) of the selection conditions may not be specified. 
Also in this case preexisting defaults are used. 

In general, the SET SCHEMA statement could “hit” more than one schema 
version, for example when intervals of valid- or transaction-time are specified. 
To solve this problem we distinguish between two cases: 

— for schema modifications we require that only one schema version is selected, 
otherwise the statement is rejected and the transaction aborted; 

— for retrievals several schema versions can qualify for temporal selection at 
the same time. In this case, retrievals can be based on a completed schema 
[10], derived from all the selected schema versions. 

Further details on multi-schema query processing can be found, for instance, 
in [3, 9]. The scope of a SET SCHEMA statement is, in any case, the transaction in 
which it is executed. Therefore, for transactions not containing any SET SCHEMA 
command, a global default should be defined for the database. As far as transac- 
tion and valid time are concerned, the current and present schema is implicitly 
assumed, whereas a global default label could be defined by the database ad- 
ministrator by means of the following command: 

SET CURRENT_LABEL <label> 

Obviously, this definition is overridden by explicit specification in SET SCHEMA 
statements found in transactions. 



3.2 Extensions to the ODL for direct DAG manipulation 

ODMG includes standard languages (ODL and OQL) for object-oriented data- 
bases. Using ODL, only one schema, the initial one, can be defined per database. 
The definition of the schema consists of specifications concerning model elements. 
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like interfaces, classes and so on. The main object of our ODMG language ex- 
tensions is to enable full schema development and maintenance by introducing 
the concept of schema version at language level. 

Node creation statement 

A schema version can only be explicitly introduced through the CREATE SCHEMA 
statement added to ODL. It requires a new label associated with the new schema 
version. In this way, the explicit creation of a schema version always coincides 
with the addition of a new node to the DAG. Afterwards, when the schema 
undergoes changes, any new schema version in the same node may only be the 
outcome of a sequence of schema changes applied to versions of the node, since 
we follow the internal change principle. The CREATE SCHEMA syntax is: 

CREATE SCHEMA <label> 

<schema def> 

[<schema change validity>] 

<schema def>:: = 

<specification> | FROM SCHEMA <schema selection condition> 

<schema change validity > :: = 

VALID <datetime value expression> 

The CREATE SCHEMA statement provides for two creation options: 

- the new schema version can be created from scratch, as specified in the 
<specification> part. In this case the new node is isolated; 

— the new schema version can be the copy of the current schema version se- 
lected via the <schema selection condition>. In this case, the node with 
the label specified in the FROM SCHEMA clause becomes the father of the new 
node with the label specified in the CREATE SCHEMA clause. 

The optional < schema change validity > clause is introduced in order to specify 
the validity of the schema change, enabling retro- and pro-active changes [4]. 
The new schema version is assigned the “version coordinates” <label>, [now, oo] 
and validity <datetime value expression>. When the <schema change validity> 
clause is not specified, the validity is assumed to be [— 00 , 00 ]. The node creation 
is also recorded in the database DAG by means of a transaction timestamp 
[norc,oo], which is associated with the node label. 

Let us consider as an application example the activity of an engineering 
company interested in designing an aircraft. As a first step, an initial aircraft 
structure (schema) is drawn, whose instances are aircraft objects. This can be 
done by the introduction of a new node, named “draft” , that include one schema 
version (valid from 1940 on) containing the class “Aircraft”, as follows: 

CREATE SCHEMA draft 
class Aircraft! 

attribute string name}; 

VALID PERIOD ’ [1940-01-01 - forever] ’ 
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Fig. 1. Outcomes of the application of CREATE SCHEMA statements 



The left side of Figure 1 shows the schema state after the execution of the above 
statement. In a second step, the design process can be entrusted to independent 
teams to separately develop the engines and the cockpit. For instance, the team 
working on the cockpit can use a new node, named “cockpit” derived from the 
node “draft” by means of the statement (the outcome is shown on the right side 
Figure 1): 

CREATE SCHEMA cockpit 

FROM SCHEMA LABEL draft AND VALID AT DATE ’now’ 

VALID PERIOD ’ [1998-01-01 - forever] ’ 

The new node contains a schema version which is a copy of the current schema 
version belonging to the node “draft” and valid at “now” («SVi). The temporal 
pertinence to be assigned to the new schema version is [now, oo]x [1998/01/01, oo]. 
In a similar way, a new node “engine” can be derived from the node “draft” . 

Node deletion statement 

The deletion of a node is accomplished through the DROP NODE statement. Notice 
that the deletion implies the removal of the node from the current database. This 
is effected by setting to “now” the end of the transaction-time pertinence of all 
the schema versions belonging to that node and of the node itself in the DAG. 
The syntax of the DROP NODE statement is simply: 

DROP NODE <label> 

The DROP NODE statement corresponds to a primitive schema change which is 
only devoted to the deletion of nodes. Thus, the node to be deleted has to be 
isolated. The isolation of a node is accomplished by making the required DROP 
EDGE statements precede the DROP NODE statement. 
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Edge manipulation statements 

Edges between nodes can be explicitly added or removed by applying the CREATE 
EDGE or DROP EDGE specifications, respectively. The corresponding syntax is: 

CREATE EDGE FROM <label> TO <label> 

DROP EDGE FROM <label> TO <label> 

The CREATE EDGE statement adds a new edge to the DAG with transaction-time 
pertinence equal to [now^ oo]. The DROP EDGE statement removes the edge from 
the current database DAG by setting its transaction-time endpoint to now. 

3.3 Extension to the ODL for schema version modification 

Schema version modifications are handled by two collections of statements. The 
former acts on the elements of the ODMG Object Model (attribute, relationship, 
operation, exception, hierarchy, class and interface), whereas the latter integrates 
existing characteristics of schema versions into other schema versions also inter- 
vening on the DAG. All the supported statements correspond to operations that 
do not operate an update-in-place but always generate a new schema version. All 
these operations act on the current schema version selected via the SET SCHEMA 
statement (see Subsections. 1). 

The first collection includes a complete set of operations [1] handled by the 
CREATE, DROP and ALTER commands modified to accommodate the extensions: 

CREATE <element specification> [<schema change validity>] 

DROP <element name> [<schema change validity>] 

ALTER <element to alter> [<schema change validity>] 

The main extension concerns the possibility of specifying the validity of the new 
schema version (the full syntax of unexpanded non-terminals can be found in 
Appendix A). The outcome of the application of any of these schema changes 
is a new schema version with the same label of the affected schema version and 
the validity specified, if the <schema change validity> clause exists, [— oo, oo] 
otherwise. Suppose that, in the engineering company example, the team working 
on the “engine” part is interested in adding a new class called “Engine” and an 
attribute in the class “Aircraft” to reference the new class. This can be done 
by means of the following statements (the outcome is shown on the left side of 
Figure 2): 

SET SCHEMA LABEL engine 

AND VALID AT DATE ’1999-01-01’; 

CREATE CLASS Engine VALID PERIOD ’ [1940-01-01 - forever] ’ ; 

CREATE ATTRIBUTE set<Engine> engines IN Aircraft 
VALID PERIOD ’ [1940-01-01 - forever] ’ ; 

The second collection includes the following statements (the full syntax can 
be found in Appendix A): 
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S'U 



claBS Aircraft ( 
attribute 

string nam; 
attribute 

set<Engi ne> eng ines : 

); 

class Engine! 

] 







of 



aircraft 



class Aircraft! 
attribute 

string naaie; 
attribute 

set<Bngine> engines; 

); 

class Engine! 

) 



Fig. 2. Outcomes of the application of modification statements 



ADD <element to add> FROM SCHEMA <schema selection condition> 
[<schema change validity>] 

MERGE FROM SCHEMA <schema selection condition> 

[< schema change validity >] 

The ADD statements can be used for the integration of populated elements, like 
attributes or relationships, in the affected schema version, whereas the MERGE 
statement can be used for the merging of two entire schema versions. They 
originate from the CAD/CAM field where they are necessary for a user-driven 
design version control. They implicitly operate on the DAG by binding the in- 
volved nodes by means of edges. Both statements require two schema versions: 
the alfected schema version, selected via the SET SCHEMA statement, and the 
source schema version from which the required characteristics are extracted or 
are to integrate, selected via the <schema selection condition> clause. Notice 
that the main difference between the use of the ADD and the CREATE statements 
is that the former consider populated elements (with values inherited) belonging 
to the source schema version, while the latter simply adds new elements with a 
default value. In our airplane design example, two consolidated schema versions 
from the nodes “engine” and “cockpit” can be merged to give birth to a new 
node called “aircraft” . This last represents the final state of the design process: 

CREATE SCHEMA aircraft 

FROM SCHEMA LABEL engine AND VALID AT DATE ’1990/01/01' 

VALID PERIOD ’ [1950-01-01 - forever] ’ ; 
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SET SCHEMA LABEL aircraft 

AND VALID AT DATE > 1999-01-01 ’ ; 

MERGE FROM SCHEMA LABEL cockpit 
AND VALID IN PERIOD ’ [2010] ’ 

VALID PERIOD ’ [1950/01/01 - forever] ’ ; 

The outcome is shown on the right side of Figure 2. The first statement creates 
a new node called “aircraft” by making a copy of the current schema version 
belonging to the “engine” node valid at T990/01/01’. Then, the SET SCHEMA 
statement sets the default values for label and valid time. The last statement 
integrates in the selected schema version the schema version belonging to the 
“cockpit” node valid in 2010. The temporal pertinence to be assigned to the new 
schema version is [1950/01/01, oo]. 

3.4 Data manipulation operations 

Data manipulation statements (retrieval and modification operations) can use 
different schema versions if preceded by appropriate SET SCHEMA instructions. 
However, we propose that single statements can also operate on different nodes 
at the same time. To this purpose, labels can also be used as prefixes of path 
expressions. When one path expression starts with a label of a node, such a node 
is used as a context for the evaluation of the rest of the path expression. For 
instance, the following statement: 

SELECT d.ncune 

FROM draft . Aircraft d, cockpit .Aircraft c 
WHERE d.name=c .name 

retrieves all the aircraft names defined in the initial design version labelled 
“draft”, which are still included in the successive “cockpit” design version. 
The expression draft .Aircraft denotes the aircraft class in the node labelled 
“draft”, while the expression cockpit . Aircraft denotes a class with the same 
name in the node “cockpit”. When the label specifier is omitted in a path ex- 
pression, the default label set by the latest SET SCHEMA (or the global default) 
is assumed. 

3.5 Transactions and schema changes 

Transactions involving schema changes may also involve data manipulation oper- 
ations (e.g. to overcome default mechanisms for propagating changes to objects). 
Therefore, schema modification and data manipulation statements can be inter- 
leaved according to user needs between the BEGIN TRANSACTION statement and 
the COMMIT or ROLLBACK statements. The main issues concerning the interaction 
between transactions and schema modifications are: 

— the semantics of now, that is, which are the current schema versions which 
can undergo changes and which is the transaction-time pertinence of the new 
schema versions of the committed transactions. 
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— how schema modifications are handled inside transactions, in particular on 
which state the statements which follow a schema modification operate. 

A typical transaction has the following form; 

BEGIN TRANSACTION 
SET SCHEMA LABEL t 
AND VALID ssy 

AND TRANSACTION sst\ 

COMMIT WORK 

We assume that the transaction-time Ut assigned to a transaction T is the 
beginning time of the transaction itself. Due to the atomicity property, the whole 
transaction, if committed, can be considered as instantly executed at Ut- There- 
fore, also the now value can always be considered equal to Ut while the trans- 
action is executing. Since “now” in transaction time and “now” in valid time 
coincide, the same time value Ut can also be used as the current (constant) 
value of “now” with respect to valid time during the transaction processing. In 
this way, any statement referring “now” -relative valid times within a transaction 
can be processed consistently. Notice that, if we had chosen as Ut the commit 
time of r, as other authors propose [7], the value of “now” would have been 
unknown for the whole transaction duration and, thus, how to process state- 
ments referencing “now” in valid time during the transaction would have been 
a serious problem. Moreover, since now = Ut is the beginning of transaction 
time, the new schema versions which are outcomes of any schema modification 
are associated with a temporal pertinence which starts at Ut- 

When schema and data modification operations are interleaved in a trans- 
action, data manipulation operations have to operate on intermediate database 
states generated by the transaction execution. These states may contain schemas 
which are incrementally modified by the DDL statements and which may also 
be inconsistent. A correct and consistent outcome is, thus, the developer’s re- 
sponsibility. The global consistency of the final state reached by the transaction 
will be automatically checked by the system at commit time; in case consistency 
violations are detected, the transaction will be aborted. 

4 Conclusions and Future work 

We have proposed extensions to the ODMG data definition and manipulation 
languages for generalized schema versioning support. This completes the ODMG 
extension initiated in [4-6] with the work on the data model. The proposed 
extensions are compatible with SQL-92 and TSQL2, as much as possible, and 
include all the primitive schema change operations considered for the model 
and, in general, in OODB literature. Future work will consider the formalization 
of the semantics of the proposed language extensions, based on the algebraic 
operations proposed in [6]. Further extensions could also consider the addition 
of other schema versioning dimensions of potential interest, like spatial ones [9j. 
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A Syntetx details 

The non-terminal elements which are not expanded in this Appendix can be 
found in the BNF specification of ODMG ODL [2] and TSQL2 [11]. 

<element specification>::= 

<element specification in interface> IN <interface name> 

I <interface specification> 

<interface specification>;:= 

INTERFACE <interface name> 

I CLASS <class name> 

<element specification in interface> ::= 

ATTRIBUTE <domain type> <attribute name> [<fixed array size>] 

I RELATIONSHIP <target of path> <relationship name> 

INVERSE <interface name>::<relationship name> 
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I OPERATION <op type speO <operation name> <parameter dcls> 
[RAISES(<scoped name list>)] CODE <code speO 
I EXCEPTION <exception name> TO OPERATION <operation name> 

I EXCEPTION <exception name> {[<member list>]} 

I SUPERINTERFACE <interface name> 

<element name>;:= 

<element name in interface> IN <interface name> 

I <interface specification> 

<element name in interface> :; = 

ATTRIBUTE <attribute name> 

I RELATIONSHIP <relationship name> 

I OPERATION <operation name> 

I EXCEPTION <exception name> TO OPERATION <operation name> 

I EXCEPTION <exception name> 

I SUPERINTERFACE <interface name> 

<element to alter>::= 

<element to alter in interface> IN <interface name> 

I INTERFACE NAME <interface name> INTO <interface name> 

I CLASS NAME <class name> INTO <class name> 

<element to alter in interface>::= 

ATTRIBUTE NAME <attribute name> INTO <attribute name> 

I ATTRIBUTE TYPE <attribute name> INTO <domain type> 

I RELATIONSHIP NAME <relationship name> INTO <relationship name> 

I RELATIONSHIP TYPE <relationship name> INTO <target of path> 

I INVERSE TYPE <relationship name> INTO <relationship name> 

I OPERATION NAME <operation name> INTO <operation name> 

I OPERATION CODE <operation name> INTO INPUT <op type spec> 

OUTPUT <parameter dcls> CODE <code speO 
I EXCEPTION NAME <exception name> INTO <exception name> 

I EXCEPTION TYPE <exception name> INTO {[<memberJist>]} 

<element to add>::= 

<element to add in interface> IN <interface name> 

I <interface specification> 

< element to add in interface > 

ATTRIBUTE <attribute name> 

I RELATIONSHIP <relationship name> 

I OPERATION <operation name> 

I EXCEPTION <exception name> 
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Abstract. Modifying the schema of a populated database is an expensive opera- 
tion. We propose to use the non-classical transposed storage of an object database. 
The transposed storage avoids database reorganization and reduces the number of 
input/output operations in the context of schema evolution. Thus schema changes 
are not anymore costly operations. Consequently immediate and physical propa- 
gation of schema changes can be supported. We extend the 007 benchmark with 
schema evolution operations and submit our F2 DBMS to this benchmark. The 
obtained results demonstrate the feasibility and performance of our approach. 



1 Introduction 

Schema evolution is an essential feature of a database management system (DBMS) to 
allow database applications to run in a dynamic environment. Modifying the schema of 
a populated database is a difficult problem and a very expensive operation [15]. To sup- 
port schema evolution, a DBMS should address the following main issues: (1) set of 
schema changes, (2) database consistency, (3) information preservation, (4) semantics 
of schema changes, (5) data copy due to inheritance, (6) database reorganization, (7) im- 
mediate versus deferred propagation, (8) physical versus logical propagation, (9) input/ 
output operations. We discussed the first five issues in [3] [1]. In [3] we presented the 
schema changes in the F2 DBMS (see fig. 1) which are supported thanks to the uniform- 
ity of objects in F2 and a trigger mechanism. In [1] we proposed the multiobject mech- 
anism in F2 to make schema changes more pertinent, easier to implement and less 
expensive than with the classical implementation of specialization. In this paper we fo- 
cus on the last four issues of schema evolution within object-oriented database systems. 

Database reorganization. Updates to the schema must propagate to existing ob- 
jects - since databases are supposed to be populated with objects - in order to keep the 
database in a consistent state. In the classical object storage, this results in database re- 
organization, i.e. data copying. For example, in Gemstone [15], when an attribute is de- 
leted from a class, all the instances of this class are re-written. Database reorganization 
should be avoided since it is extremely time consuming. 




Transposed Storage 



49 



- Create a new class 




- Create a new specialization constraint 


- Delete an existing class 




- Delete an existing specialization constraint 


- Update an existing class: 




- Update an existing specialization con- 


modify its description: change its name. 




straint: 


its interval if atomic class, its maxima 




change its name, the list of subclasses 


length if atomic string class; 




on which it is defined 


modify its position in the class hierar- 




- Create a new trigger 


chy: change its superclass, make it a 




- Delete an existing trigger 


subclass / non-subclass, i.e. attach / de- 




- Update an existing trigger: 


tach it from a specialization tree 




change the event for which it is defined. 


- Create a new attribute of a class 




change the list of methods it triggers 


- Delete an existing attribute 




- Create a new event 


- Update an existing attribute: 




- Delete an existing event 


change its name, its maximal cardinal!- 




- Update an existing event: 


ty, its minimal cardinality, its domain 




change the class on which it is defined, 


class, its origin class 




its kind, its attribute 


- Create a new key of a class 




- Create a new method 


- Delete an existing key 




- Delete an existing method 


- Update an existing key: 




- Update an existing method: 


change the class on which it is defined, 
its attributes, enable / disable it 




change its name 



Fig. 1. Schema changes in F2 



Immediate versus deferred propagation. A major consideration is when to bring 
the database to a consistent state with respect to the new schema [15] [12]. There are 
two main approaches: immediate and deferred. In the immediate approach, all objects 
of the database are updated as soon as the schema modification is performed, whereas 
with the deferred approach objects are updated only when they are actually used [12]. 
The two approaches offer the choice of “pay me now or pay me later” [15]. In the im- 
mediate approach, much time can be consumed when a class is modified. In the deferred 
approach, the response time of queries increases and a permanent propagation mecha- 
nism is required throughout the system’s lifetime. Gemstone [15] supports immediate 
propagation, Orion [5] supports deferred propagation and O 2 [12] supports both. 

Physical versus logical propagation. Propagation of schema changes to database 
objects can be performed either physically or logically. In the former case, the objects 
are actually changed in the database; this may require database reorganization. In the 
latter case, objects remain unchanged and are filtered during execution; this increases 
response time. Gemstone performs physical propagation (conversion) and so does Ori- 
on (screening). Encore [18] performs logical propagation (class versioning with error 
handling mechanism). Another approach to perform logical propagation is to use views 
for schema evolution [4] but this is out of the scope of this paper. 

Input/output operations. In evaluating the performance of databases, input/output 
(I/O) operation time typically dominates CPU operation time [13]. Since schema chang- 
es propagate to database objects, the number of I/O operations may be great and should 
be taken into account. 
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In this paper, we propose to use the transposed storage architecture in which data 
is partitioned in vectors. The transposed storage avoids database reorganization due to 
schema evolution. It reduces also the number of I/O operations when updating the sche- 
ma. Consequently schema changes are not anymore costly operations. The problems of 
immediate versus deferred and physical versus logical propagation are thus dissolved, 
and the immediate and physical propagation of schema changes can be chosen. 

The remainder of the paper is organized as follows. Section 2 presents the trans- 
posed storage. Section 3 gives its advantages for schema evolution. Section 4 demon- 
strates the feasibility and performance of our approach which is implemented in the F2 
object-oriented DBMS. We extend the 007 benchmark [9] with schema evolution op- 
erations and run it on F2. Section 5 mentions related approaches and section 6 concludes 
the paper. 



2 Transposed Storage 

2.1 Object Format 

The idea of transposed storage is to vertically partition the objects of a class C according 
to each attribute of C. That is, the values on an attribute of all objects are grouped to- 
gether. If C has five attributes, then it is partitioned into five collections of attribute val- 
ues called vectors. An object of C has each of its attribute values in the corresponding 
vector at the same logical offset. 

In the classical storage, an object o of a class C having n attributes attj, a« 2 , ..., 
att„ is stored in a record containing its n attribute values. In the transposed storage, it is 
spread over (n-tl) vectors: each vector v, contains its value on the attribute attj and the 
vector contains its value on the state attribute statejC. The state attribute is man- 
aged internally by the DBMS and is not visible to end-users. The value of o on state_C 
indicates whether the object o belongs to the class C, and gives the number of objects 
referencing o through attributes whose domain is C (acts like a reference counter for 
cascade deletions). The oid of o is a pair <Rc^ r> where 7?^ is the class identifier and r 
the instance identifier within C. All the attribute values of o are stored in the vectors at 
the same logical offset r (instance identifier of o). 

For example, the class Person has four attributes: name, birthdate,jobs and spouse. The 
class Employee is a subclass of Person and has two attributes: emp# and salary. Figure 
2 shows the storage of an employee object (Paul) and a person object (Pauline). An ob- 
ject which belongs to a subclass belongs also to its ancestors. In our example Paul be- 
longs to Employee and Person (positive values on the corresponding state attributes). 
Since the domain class of the spouse attribute is Person, the values on this attribute are 
oids. In the spouse vector, only instance identifiers are stored (the class identifier is the 
same for all values and can be known by querying the dictionary; hence it is not stored). 
In the example, the employee named “Paul” is married to the musician “Pauline”. 

In our example if one queries the database system to get all the persons - with all 
their attribute values - whose name begins with ‘P’, the system searches the name vector 
and gets a list of instance identifiers (rj, X 2 , —, rp) which correspond to the logical offsets 
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Fig. 2. Storage format of an employee object 



where a name beginning with ‘P’ was found. Then the system searches all the attributes 
of class Person by querying the dictionary. Finally it gets the value of each object <Rper- 
son> fi> (1 3 ^ 3 p) on each attribute of Person (it looks at position q in the corresponding 
vectors). 

A disk block contains the values of only one vector. If the block size is smaller than 
the vector size, the vector is stored in several blocks not necessarily contiguous. More- 
over, blocks corresponding to different attributes of the same class need not to be con- 
tiguous. A map in the database files indicates which are the blocks for each attribute. 

The transposed storage can be seen as an implementation of the binary model. 



2.2 Advantages and Disadvantages 

The transposed storage is advantageous for access operations (queries, navigations) that 
need to access a small part of objects value. For any operation only the blocks of in- 
volved attributes are loaded into main memory instead of blocks of whole objects as 
with the classical storage. This reduces the number of I/O operations and the I/O time 
(see the results of the 007 benchmark in table 4). Experience with the Vision system 
[17] has shown that most queries access a small proportion of the object attributes. 

The transposed storage is also advantageous when retrieving the values on the same at- 
tribute of many objects (aggregate functions as sum, average, etc.). 

The transposed storage can be disadvantageous when retrieving many (n) attribute 
values of an object, since it requires to load n blocks instead of one. It is the same when 
creating/deleting an object with n attribute values. This explains why the transposed 
storage is not widely used. Nevertheless we think that loading n blocks could be done 
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in parallel and this issue deserves to be investigated. Moreover, the arguments against 
the transposed storage do not take into account schema changes. 

When using the transposed storage there is a storage overhead but this allows easy 
object migration. In our example, the (r+1) places of the Employee vectors are unused. 
But if Pauline becomes an employee object, its values on the Employee attributes will 
be stored there. 

3 Advantages of Transposed Storage for Schema Evolution 

3.1 Less I/O Operations 

One major variable for calculating I/O time is the number of objects that can be stored 
in a disk block, known as the blocking factor (bf) [13]: bf = [disk block size / object 
sizej. In the transposed storage, a disk block contains values of only one attribute att. 
The number of values that can be stored in a disk block is: bf^jj = [disk block size / att’s 
value sizej. Since the size of a value (set-value if the attribute is multi-valued) is much 
smaller than the size of an object, the blocking factor is much greater. 

A schema change may update a great number of database objects without needing 
to access their whole value (i.e. on all attributes). For example, changing the domain of 
an attribute att of class C requires scanning the att values taken by the objects of C and 
possibly changing them. Thanks to the transposed storage whole objects are not brought 
into main memory as with the classical storage. Only the blocks of attributes involved 
by the schema change are loaded (in F2: att to update att values, state joldDomain and 
state _newDomain to update reference counters). This reduces considerably the number 
of I/O operations from disk to main memory and vice-versa and hence the I/O operation 
time (see the results of the benchmark for schema evolution in tables 1 and 2). 



3.2 No Database Reorganization 

Thanks to the transposed storage, the database is not reorganized for all the supported 
schema changes in the F2 DBMS. This is a very important result since database reor- 
ganization is extremely time consuming. We give hereafter some examples. 

• Create an attribute. When adding an attribute to class C, new blocks are reserved to 
store the attribute’s values. The objects of class C are left untouched. In the classical 
storage, all existing objects of C will have a field added to their record. Gemstone for 
example rewrites the objects of C [15]. 

• Delete an attribute. When deleting an attribute att of class C, the blocks occupied 
by the att values are released and marked free in the database map. Here again, the phys- 
ical storage is not reorganized. In the classical storage, the objects of C will have a field 
removed from their record. Gemstone rewrites the objects of C while Orion screens 
their att values. 

• Update the origin class of an attribute. Changing the origin class of an attribute (in- 
side a user-defined class hierarchy) is not equivalent to dropping the attribute and add- 
ing it to a new class because in this case values on the attribute would be lost. Very few 
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systems support it: Cocoon [19], Goose [14] and F2. In the transposed storage, this 
schema change does not reorganize the database, because the attribute values of an ob- 
ject are not stored together. Moreover the values of the modified attribute are not copied, 
because objects belonging to several classes in the same class hierarchy have their at- 
tribute values at the same logical offset in vectors. In our example of §2. 1 (fig. 2), if the 
birthdate attribute is moved from Person to Employee, Paul (employee object) keeps its 
birthdate value while Pauline (person object) loses it (the value is erased in the vector). 
If the origin class of the salary attribute is moved from Employee to Person, Paul keeps 
its salary value while Pauline gets the nil value. 

• Create a subclass. Adding a subclass may need to migrate objects between classes. 
For example, if one adds Musician as a subclass of Person, some of the Person objects 
should belong now to the Musician class. Cocoon, O 2 , and F2 (thanks to multiobjects 
[1]) allow object migration when adding a subclass. In the transposed storage, migrating 
an object does not require copying it. Only its values on the involved state attributes are 
updated. In our example (fig. 2), Pauline (person object) becomes an object of Musi- 
cian: its value on the state _Musician attribute is set to 0 and its other attribute values are 
left untouched. 

Thanks to the transposed storage, the database is not reorganized when updating 
the schema. Consequently the execution time of schema changes is considerably re- 
duced (see the results of the benchmark for schema evolution in tables 1 and 2). 



4 Benchmark for Schema Evolution 

The transposed storage architecture is implemented in the F2 DBMS. F2 is a general 
purpose database system that has been developed at CUI since 1989 and has been used 
to experiment several features such as: updatable views, information system design 
methods, knowledge databases, database integration, and schema evolution [4] [3] [2] 
[1]. It is written in Ada and runs under SunOS, DEC/ ALPHA, MacOS and Windows 
95. To test our approach, we extend the 007 benchmark [9] [10] to take into account 
schema evolution operations and submit F2 to it. We chose to use the 007 benchmark 
because it is the current industry standard OO benchmark. F2 supports schema evolu- 
tion and propagates schema changes immediately and physically. 

4.1 The 007 Database 

The 007 database [9] [10] is composed of a design library and an assembly hierarchy. 
The design library is a set of composite parts. Associated with each composite part is a 
document. Each composite part has an associated graph of atomic parts. A module rep- 
resents an assembly hierarchy of seven levels. The first level of the assembly hierarchy 
consists of base assemblies while the higher levels are made up of complex assemblies. 
Each base assembly is associated with three composite parts. Each module has an asso- 
ciated manual. The schema of the 007 database is illustrated in figure 3. 
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Fig. 3. The 007 schema in F2 



Number of 
objects in: 
Module: 1, 
Manual: 1, 
Complex A.: 
364, Base A.: 
729, Composite 
P.: 500, 

Document: 500, 
Atomic P.: 

10.000 (small-3) 
or 100,000 
(medium-6). 
Connection: 

30.000 (small-3) 
or 600,000 
(medium-6) 



4.2 Testbed Configuration 

We take the same testbed configuration described in [9] to run the benchmark on the F2 
system (Server: Sun IPX with 48 MB main memory. Client: Sun Sparc ELC with 24 MB 
main memory, both running SunOS 4. 1 .2). Like Objectivity, F2 has no server process 
and its client process accesses the database blocks via NFS. Its client buffer pool is set 
to 12,000 blocks of 1 KB (total: 12 MB). The queries in F2 are written in the Ada lan- 
guage. We take the same “small-3” and “medium-6” single-user 007 databases (we 
generated the databases according to the code in [10]). Their size is given in table 3. 

4.3 Schema Evolution Operations 

To our knowledge, there is no benchmark dedicated to testing the performance of a da- 
tabase system with respect to schema evolution. We propose to extend the 007 bench- 
mark with schema evolution operations. We add two new categories of operations: 
changes on the class hierarchy (create/delete a class, change its superclasses) and 
changes on the class’ content (attributes). An operation may be composed of several 
schema and database changes. For each operation, we indicate how the schema and data 
should be modified independently of any database system. Each operation is performed 
on the initial state of the database unless we specify another state. In the following ta- 
bles, all times are in seconds and they are cold times.' ^ 



' Cold times do not include commit times as in [10], Warm times for a schema change do not make 
sense. 
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Operations on the Class Hierarchy. 

• ChangeClass 1 : test class creation. 

Create the class Language (uniqueness of class names is checked). 

• ChangeClass2: test class creation with object migration. 

Create the class OldAtomicPart as a subclass of AtomicPart (uniqueness of class names 
is checked) and move the objects of AtomicPart whose builddate is before 1500 to Ol- 
dAtomicPart. 

• ChangeClassS: test class deletion with objects and attributes deletion. 

Delete the class Document and its objects. The attributes of Document {title, id, text, tex- 
tLen, part) and the attribute CompositePart.documentation^ are also deleted. 

• ChangeClass4: test class deletion with object migration. 

We suppose that the class OldAtomicPart has already been created (see ChangeClass2). 
This schema operation moves the objects of class OldAtomicPart to class AtomicPart 
and then deletes OldAtomicPart. 

• ChangeClassSA: test class deletion with its descendants. 

We suppose that the attributes ComplexAssembly.subAssemblies, Assembly. super- 
Assembly, Module.designRoot, Module.allBases, BaseAssembly.components, Compos- 
itePart.usedln have been deleted. This schema operation moves the objects of classes 
Assembly, ComplexAssembly and BaseAssembly to class DesignObj and then deletes 
Assembly and its subclasses ComplexAssembly and BaseAssembly. 

• ChangeClassSB: test class deletion and update of its subclasses’ superclass (to be 
generalized). 

We take the same supposition of ChangeClassSA. This schema operation updates the 
superclass of BaseAssembly and ComplexAssembly from Assembly to DesignObj. Then 
it moves the objects of class Assembly to class DesignObj and deletes Assembly. 

• ChangeClassb: test class’ superclass update (to be specialized). 

Create the class Part as a subclass of DesignObj and update the superclass of Compos- 
itePart and AtomicPart from DesignObj to Part. 



Table 1. Times of schema evolution operations in [s] 





Change 
Class 1 


Change 

Class! 


Change 

Class3 


Change 

Class4 


Change 

Class5A 


Change 

Class5B 


Change 

Class6 


small 


0.4 


27.8 


8.3 




11.0 


98.6 


mm 


medium 


0.4 


286.4 


49.0 


230.4 


11.1 


850.9 


560.4 



Operations on the Class’ Content. 

• ChangeAttl : test attribute creation. 

We suppose that the class Language has already been created (see ChangeClass 1). Add 
the attribute language (of domain Language) to the class Document. 



^ We did not take into account the time to delete an index when deleting a class or an attribute 
(ChangeClassS, ChangeClass5A). The index is set as unavailable and a purge function can be 
called later (for example before leaving the DBMS) to delete all the unavailable indexes. 

^ The notation C.att is used for the attribute att of class C. 
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• ChangeAtt2: test attribute deletion from a class. 

Delete the attribute docid of the class AtomicPart. 

• ChangeAttS; test attribute deletion from a class and its descendants. 

Delete the attribute type from the class DesignObj and all the classes which inherit it 
{Module, Assembly, ComplexAssembly, BaseAssembly, CompositePart, AtomicPart). 

• ChangeAtt4: test attribute’s domain update (cardinality). 

Update the domain of the attribute CompositePart.documentation from Document to 
list( Document). The existing values on this attribute should not be lost. 

• ChangeAtt5: test attribute’s domain update (to be generalized). 

Update the domain class of the attribute CompositePart.usedIn from BaseAssembly to 
Assembly. The values taken on this attribute should not be lost. 

• ChangeAttb: test attribute’s domain update (to be specialized). 

Update the domain class of the attribute ComplexAssembly.subAssemblies from 
bly to BaseAssembly. The valid values (objects belonging to BaseAssembly) taken on 
this attribute should not be lost, while the others should be replaced by nil. 

• ChangeAtt?: test attribute’s origin update (to be generalized). 

Update the origin class of the attribute BaseAssembly.components to Assembly. The 
base assemblies should keep their value on this attribute, and the other assemblies 
should take the nil value. 

• ChangeAttS: test attribute’s origin update (to be specialized). 

Update the origin class of the attribute DesignObj. type to CompositePart. This attribute 
is no more inherited by the classes Module, Assembly, ComplexAssembly, BaseAssem- 
bly and AtomicPart. The composite parts should keep their value on this attribute. 



Table 2. Times of schema evolution operations in [s] 





Change 

Attl 


Change 

Att2 


Change 

Att3 


Change 

Att4 


Change 

Att5 


Change 

Att6 


Change 

Att7 


Change 

Att8 


small 










ZT 


0.8 


■ 0.3 


7.0 


medium 


0.4 


0.2 


0.2 


21.5 


7.3 


0.8 


0.3 


73.4 



Comments. 

- The results of this benchmark show that the architecture of the F2 DBMS, based on 
the transposed storage, is realistic and efficient for schema evolution operations. The 
transposed storage avoids database reorganization (§3.2) and reduces the number of 
I/O operations (§3.1) when modifying the schema. This explains the short times of 
F2. 

- The most expensive schema changes are those which update the superclass of a sub- 
class (ChangeClassSB, ChangeClassb). Since F2 supports specialization constraints 
and automatic classification [1], it checks whether objects should be reclassified 
when the superclass of a subclass is updated. 

- The execution times of the operations on class content are very short. The time of 
ChangeAttS is due to reference counters update. 

- It would be interesting to run this benchmark on other OODBMS supporting schema 
evolution and to compare the results, but we can not do it because “it is actually a 
violation of the system’s licence agreement, and therefore a very real potential 
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source of a law suit, to purchase an OODBMS, run a benchmark, and publish the 
results.” [8], 

4.4 007 Operations 

We run the 007 operations too on F2 to show that F2 performs well even when schema 
evolution is not involved. The 007 benchmark operations are divided into three groups: 
queries, traversals and object modifications. The results presented here are for the 
“small-3” and “medium-6” single-user 007 benchmark databases."^ They include the 
results of Exodus, Ontos, Objectivity and Versant database systems which are directly 
taken from [9]. Our goal is not to claim that the F2 database system, which is not opti- 
mized, is better than other systems, but it is to show that our approach is realistic and 
that it is interesting for both operations on schema objects and on database objects. In 
the following tables, all times are in seconds and they are cold times.^ A ‘*’ indicates 
that the result is not provided. 

The size of the databases (in Megabytes) in each of the systems is given in table 3. 



Table 3. Size of the 007 databases in [MB] 



DB size 


Exod 


Ontos 


Objy 


Versnt 


F2 


small 


11.5 


4.2 


5.7 


i 


6.4 


medium 


125.6 


122.3 


74.9 


* 


52.8 



The queries include exact-match queries (Ql), range queries (Q2, Q3), sequential 
scan queries (Q7), pointer-based joins (Q4, Q5) and value-based joins (Q8). The traver- 
sals include dense traversals (Tl), dense traversals with updates (T2A, T3A), sparse 
traversals (T6), access to very large objects (T8, T9). The object modifications include 
object creation (I) and object deletion (D). A detailed description of these operations 
can be found in [9] (see appendix). The results of the 007 benchmark operations are 
illustrated in figure 4 (and given in the appendix). 

Comments. 

- We were surprised by the differences of the database sizes from one DBMS to an- 
other. In F2, we did not use locks since the benchmark is for single-user databases. 
This maybe explains why we have a so small medium database. 

- The results of the 007 benchmark clearly show that the architecture of the F2 sys- 
tem, based on the transposed storage, is realistic and efficient for operations on da- 
tabase objects too. 

- Queries Ql , Q2, Q3, Q7 access a small proportion of objects (one or two attributes). 
Similarly, navigational queries Q4 and Q5 access four and three attributes respec- 
tively. This explains the short times of F2 as said in §2.2. For traversals Tl, T2A and 



Since the results for the large databases are not provided in [9], we did not implement them. 
^ Cold times do not include commit times as in [10]. 
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Fig. 4. Comparison of 007 results 

T3A, F2 has better results in the medium database. Traversals T8, T9 access one at- 
tribute. 

The worst result of F2 is for the insert operation. This can be explained by two facts: 
the transposed storage requires to load n blocks to insert an object with n attribute 
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values, and F2 performs automatic classification when creating an object, which is 
not useful in the 007 benchmark. 

- For the delete operation we did not write any code (as it was needed in the other test- 
ed systems) because the F2 system enforces referential and existential dependen- 
cies. This time saving should be taken into account. 



5 Related Work 

The transposed storage is supported in few DBMS: Rapid [6], Vision [17] and Monet 
[7]. All these systems argue for transposed storage because it reduces the transfer of un- 
needed data. They do not mention its advantages with respect to schema evolution. 

Several OODBMS support schema evolution. We can classify them in three cate- 
gories: schema evolution without versioning {Orion [5], Gemstone [15], Cocoon [19], 
Goose [14], O 2 [12]), schema evolution with versioning {Encore [18], Orion, Goose, 
Closql), schema evolution with views. The approach of F2 described in this paper is in 
the first category. DBMS of this category differ by: the set of supported schema chang- 
es, the semantics of schema changes and the propagation of schema changes. None of 
them supports the transposed storage. 



6 Conclusion 

The F2 system has a non-classical transposed storage architecture. The transposed stor- 
age is interesting for schema evolution, since it avoids to reorganize the database and 
reduces I/O operations. Thus schema changes are not anymore costly operations and the 
issues of immediate versus deferred and physical versus logical propagation become 
obsolete. 

To validate our approach, we proposed to extend the 007 benchmark to include 
schema evolution operations and ran these tests on F2. The obtained results clearly 
show that the F2 system has efficient performances for schema evolution operations. 
According to the results of the 007 benchmark operations, we can claim that the F2 
system has realistic performances for object operations too. The times of object creation 
and deletion could be improved by using a parallel algorithm. This issue deserves to be 
investigated. 
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1.98 


0.5 


2.22 


1.1 



Ontos Objy 



Q 1 [generate 10 random ids; for each id generated, looku p the atomic p art wi th t hat id. 

small 0^ 2.3 

medium 0.8 9.7| 9.6] 2.22 



Q2 choose a range for dates (containing the last 1% of the dates found m the atomic parts). Retrieve the 
atomic parts that satisfy this range predicate and read their id. 
small 



medium! 18.0 



same as Q2, but the range contains the last I 



small 



medium 34.7 



Q7 |scan all atomic parts and read their id. 

small 

medium 



) of the dates. 




Q4 generate 10 random document titles. For each title generated, rind the composite part P correspond- 
ing to the document. Then find all base assemblies that use P and read their id. 

small 1_^ ^ 9. 

medium 1.6 29.9 11.0] 2.66 



Q5 find all base assemblies that use a composite part with a build date later than the build date of the 
base assembly and read their id. 



2.50 


1.1 


2.66 


1.3 



small 



medium 



14.9 



16.6 22.7 36.2 



Q8 find all pairs of documents and atomic parts where the document id (docld attribute) in the atomic 
part matches the id of the document (id attribute). For each pair, read also the id of the atomic part. 

sinaill O] 28?7l 14831 18^ 

medium 64.7 101.9 227.3 183.11 189.0 



T1 traverse the assembly hierarchy (from the root, in depth-first manner). As each base assembly is 
reached, visit each of its referenced composite parts. As each composite part is visited, perform a 
depth-first search on its atomic graph (read the id of atomic parts). 

sinaill 3481 28l9l 501 245.08| TTO 

medium 965.4 2516.4 1269.4 4941.82 667.2 



T2A repeat traversal T 1 , but in addition update objects during the traversal: swap the (x,y) attributes of the 
root atomic part in each composite part. 

iinaill 501 501 6431 251.581 593 

medium 1000.8 2102.2 1468.8 5452.66 683.7 



T3A repeat traversal T2A, except that now the update is on the date field which is indexed. The date is 
incremented if it is odd and decremented if it is even. 

iinairi 3931 4731 653] 251.381 8T3 

medium 1083.9 6467.3 1389.7 5464.39 774.5 



T6 traverse the assembly hierarchy (from the root, in depth-first manner). As each base assembly is 
reached, visit each of its referenced composite parts. As each composite part is visited, visit the root 
atomic part (read its id). 

imall| TO] iT3] 1931 T539 

medium 29.7 51.2 46.3 37.84| 17.8 



T8 scan the manual object (te xt attnb ut e), coun tin g the n umb er of occurences of the character “1 

small 

medium! 12.3 



T9 checks to see if the first and lasU:haracter jn the manual object (text attribute) are the same. 
small 

medium! 0.2[ 4.8[ 11. 0| 2.54 



create five new composite parts, which includes creating new atomic parts (100 in small, 1000 in 
medium), connections (300 in small, 6000 inmedium) and documents (5 in both). Insert these com- 
posite parts in the database by installing references to them in five chosen base assemblies. 



1 medium 
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1 25.931 


205.1 


delete the five newly created composite parts (and all 
ments). 


1 of their associated atomic parts and docu- 
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Abstract 

During the process of updating a database, two interrelated problems could 
arise. On one hand, when an update is applied to the database, integrity 
constraints could become violated, thus falsifying database consistency. In 
this case, the integrity constraint maintenance approach tries to obtain 
additional updates to be applied to re-establish database consistency. On the 
other hand, when an update request consist on updating some derived 
predicate, a view updating mechanism must be applied to translate the 
update request into correct updates on the underlying base facts. 

In this paper, we propose a general framework to compare and classify 
current methods in the field of view updating and integrity constraint 
maintenance. In this sense, we classify them considering how they tackle 
with both problems and, we also state the main drawbacks these methods 
have. 



1. Introduction 

Most databases, like relational or deductive ones, allow the definition of intentional 
information like views or integrity constraints. Intentional information is defined by 
means of rules that allow to deduce new data (i.e. intentional data) from that one 
explicitly stored in the database (i.e. extensional data, like tuples in a relational 
database or base facts in a deductive one). Therefore, databases must include a query 
and an update processing system able to deal with this kind of information. This paper 
addresses some of the problems encountered during update processing and summarizes 
previous research in this area. 

Views and integrity constraints are the most traditional types of intentional 
information. Views are defined by means of deductive rules that allow to define new 
facts (view or derived facts) from stored (base) facts, while integrity constraints state 
conditions to be satisfied by each state of the database. 

Databases are updated through the application of given transactions that consist of 
a set of updates of base facts. When applying a transaction, database consistency may 
be falsified, i.e. some integrity constraint may be violated. Therefore, databases must 
incorporate some mechanism to ensure that integrity constraints are always satisfied 
after the application of a transaction. This problem is usually known as integrity 
constraint enforcement. There are several approaches of resolving this conflict 
[Win90]. All of them are reasonable and the correct approach to be considered depends 
on the semantics of the integrity constraints and of the database. The best known 
approaches are integrity constraint checking and integrity constraint maintenance. 
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Integrity constraint checking is the most conservative approach to deal with 
integrity constraints since it rejects the transactions that, if applied, would violate 
some integrity constraint. An important drawback of this approach is that the user 
may be completely lost regarding possible changes to be made to the transaction to 
make it obey the integrity constraints. 

An alternative approach, aimed at overcoming this limitation, is that of integrity 
constraint maintenance, which is concerned with trying to identify additional updates 
(i.e. repairs) to be added to the original transaction to guarantee that the resulting 
transaction does not violate any integrity constraint. 

Views provide several advantages like simplifying the user interface or favoring 
logical data independence. However, these advantages can only be achieved if a user 
does not distinguish a derived fact from a base one. Therefore, an update processing 
system must provide also the ability to deal with updates of derived facts. However, 
since the view extension is completely defined by the application of deductive rules to 
the contents of the database, changes requested on a view must always be translated 
into changes of the stored base facts. 

The problem of appropriately translating an update of a set of derived facts into 
appropriate updates of the underlying base facts is known as view updating. In 
general, several translations that satisfy the required update exist. Each translation 
defines a possible transaction that, if applied to the current database, would satisfy the 
requested update. 

View updating and integrity constraint enforcement are strongly related. On one 
hand, a translation obtained by view updating could violate some integrity constraint. 
On the other, view updating must be performed when considering a repair of an 
integrity constraint through an update of a derived fact. Moreover, as shown in 
[T095], view updating and integrity constraint checking can be successfully performed 
as two separate steps, while view updating and integrity constraint maintenance 
cannot. 

This paper reviews previous research in the field of view updating and integrity 
constraint maintenance. Its goal is to identify the relevant features to be taken into 
account when solving these problems and to summarize tbe achievements of previous 
methods according to those features. A survey of the early methods in the area of 
integrity constraint maintenance is provided in [FP93]. A comparison of some 
techniques for view updating and integrity maintenance is given in [T095]. 

This paper is structured as follows. Section 2 discusses on the relevant features to 
be taken into account during view updating and integrity constraint maintenance and it 
provides a general classification and comparison framework of relevant work according 
to these features. Section 3 describes and analyzes some of the best known methods 
and states individual limitations they present. Finally, section 4 summarizes the main 
conclusions and points out aspects for further research. 



2. View Updating and Integrity Constraints Enforcement 

The main aspects that must be taken into account during the process of view 
updating and integrity constraint enforcement are the following: the problem 
addressed, the considered database schema, the allowed update requests, the used 
technique and the obtained solutions. The following figure illustrates these aspects: 
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Solutions 



Fig.l Relevant aspects of view updating and integrity enforcement 



Given an update request, a database schema, i.e. a set of deductive rules and/or 
integrity constraints; and a database extension, i.e. a set of base facts, methods 
proposed in this area are attempted to provide possible solutions to the concrete 
problem addressed by the method. Each particular method will use a certain technique 
to obtain the solutions to the problem it addresses. These five aspects provide the 
basic dimensions to be taken into account to summarize previous research in the field 
of view updating and integrity constraint maintenance. 

In the rest of this section, we explain each dimension in detail and present their 
relevant features. These features are then used to compare the methods that deal with 
these problems. Ordered according to the date of publication, the methods considered 
in this survey are [GL90, KM90, ML91, MT93, Wut93, CFPT94, Ger94, CHM95, 
CST95, T095, Sch96, Dec97, LT97, Maa98, Sch98]. Results of each method 
according to those features are summarized in Table 2. 



2.1 Problem Addressed 

Not all methods cover both view updating and integrity constraint maintenance. 
Moreover, not all methods for view updating incorporate also an integrity 
maintenance policy. Therefore, we can classify these methods according to the 
following features: 

- Whether they are able to deal with view updating or not (indicated by Yes or 

No in the second column of Table 2). 

- Whether they incorporate an integrity constraint checking or an integrity 

constraint maintenance approach (indicated by check or maintain in the third 

column). 

We have that [KM90, MT93, Wut93, CST95, T095, Dec97, LT97] cover both 
view updating and integrity constraint maintenance, while [ML91, CFPT94, Ger94, 
Sch96, Maa98, Sch98] deal only with integrity constraint maintenance and, thus, are 
not able to handle view updates. Moreover, [GL90, CHM95] deal with view updating 
but apply an integrity constraint checking approach, although [CHM95] is also able 
to propose additional repairs for certain specific integrity constraints. 

Some of the methods perform all the work required to obtain the solutions at run- 
time, i.e. when the real update request and the contents of the database are known; 
while others perform some preparatory work at compile-time, i.e. when only the 
database schema and a parameterized update request are known. Work at compile-time 
is intended to generate a program that executed at run-time, when actual values are 
given to the formal parameters, will provide the desired solutions. Hence, we have a 
third feature regarding the problem dimension: 

- Whether the method follows a run-time or a compile-time approach (indicated 

by Run or Compile in the fourth column). 
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From the methods considered, only [Sch96] follows a pure compile-time approach, 
while [GL90, KM90, ML91, Wut93, CST95, Dec97, LT97] are purely run-time. 
Methods like [MT93, CFPT94, Ger94, CHM95, T095, Maa98, Sch98] perform 
work either at run-time and at compile-time. 

2.2 Database Schema Considered 

This dimension refers to the language used to define views and integrity 
constraints, to the restrictions imposed to the views and integrity constraints handled 
by each method and to the kind of integrity constraints considered by each method. In 
total, we have four different features relevant to this dimension: 

- Definition Lan^ua^e : the language mostly used is logic [GL90, KM90, ML91, 
MT93, Wut93, CST95, T095, Sch96, Dec97, LT97, Maa98], although some 
methods [CFPT94, Ger94, Sch98] use a relational language and [CHM95] uses an 
object-oriented one. 

- The DB Schema Contains Views : obviously, all methods that deal with view 
updating need views to be defined in the database schema. From the rest of the 
methods, [ML91, CFPT94] allow to define views, although repairs through views are 
not directly handled by these methods during integrity constraint maintenance. 

- Restrictions Imposed on the Integrity Constraints : some proposals impose 
certain restrictions on the kind of integrity constraints that can be defined and, thus, 
handled by their methods. 

The most typical restriction is that of dealing only with flat integrity constraints. 
An integrity constraint is flat if it is defined only in terms of base and evaluable 
predicates, i.e. if its definition does not contain any view. This is an important 
restriction since, as shown in the following example, not every possible condition can 
be defined as flat integrity constraints. 

Example 2.1: Assume that we want to state a constraint. Id, that all people in 
labor age must be employed, where people employed is defined as those people that 
works in some company. With non-flat integrity constraints, this constraint can be 
sated in logic as follows: 

Employed(x) Works-in (x, y) a Company(y) 

Id <— Labor-age(x) a -■ Employed(x) 

However, it is not possible to reduce Id to a flat integrity constraint expressing 
the same restriction. 

The methods [Ger94, CST95, Sch96, LT97, Maa98, Sch98] can only handle flat 
integrity constraints; [CFPT94, Ger94, CHM95, CST95, Sch96, LT97, Maa98, 
Sch98] impose other restrictions to the integrity constraints they can handle; while 
the other methods do not apply any particular restriction. 

- Static vs. Dynamic Integrity Constraints : integrity constraints may be either 
static, and impose restrictions involving only a certain state of the database, or 
dynamic, and impose restrictions like “salaries cannot decrease” involving more than 
one state. All the methods are able to deal with static integrity constraints, while 
[MT93, Ger94, T095, Maa98] are also able to deal with dynamic integrity constraints 
that involve two consecutive states of the database. 
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2.3 Update Requests Allowed 

Update processing starts always by a user requesting to apply an update on a given 
database. In general, an update request can be a set of updates, i.e. insertions, deletions 
and/or modifications, to be applied to the database. Hence, we have two different 
features related to the update requests allowed: whether a request may be a set and 
whether all three update operators are allowed. 

- Multiple Update Request : an update request is multiple if it contains several 
updates to be applied together to the database. Only [GL90] does not allow multiple 
update requests. 

- Update Operators . Traditionally, three different basic update operators are 
distinguished: insertion (i), deletion and modification (/t). Modification can always 
be simulated by a deletion followed by an insertion. However, it is important to 
remark those methods that deal with modifications as such because of the different 
semantics of each operator and since it determines possible different solutions to view 
updating, as shown for instance in [MT93], 

2.4 Update Processing Mechanism 

This dimension refers to the mechanism used by each method to perform view 
updating and/or integrity constraint maintenance. According to this dimension, three 
different features are relevant: 

- Applied Technique : the techniques applied by these methods can be classified 
according to four different kinds of procedures. A first group, including [Wut93, 
CST95, LT97], is aimed at incorporating in the update request the information 
provided by the integrity constraints and unfolding this expression. Solutions can then 
be drawn from the resulting formula. A second group of methods [GL90, KM90, 
MT93, T095, Dec97] are based on extending SLDNF resolution to obtain the 
solutions. A third group [CFPT94, Ger94, CHM95, Sch98, Maa98] is aimed at 
deriving active rules that, when triggered and executed at run-time, will provide the 
solutions to the problem addressed. Finally, a fourth group [Sch96] is aimed at 
generating at compile-time predefined programs to deal with the actual request at 
execution time. These four approaches are denoted in Table 2 as: unfolding, SLDNF, 
active and predefined programs, respectively. 

- Taking Base Facts into Account : base facts can either be taken into account 
[ML91, MT93, CFPT94, Ger94, CHM95, CST95, T095, Maa98, Sch98] or not 
[GL90, KM90, Wut93, LT97, Dec97] during update processing. 

- User Participation : some methods may require the user participation during 
update processing [Wut93, CFPT94, Ger94, Sch96, LT97], while others do not 
[GL90, KM90, ML91, MT93, T095, CST95, CHM95, Dec97, Sch98, Maa98]. User 
participation may be desirable to choose among several possible alternatives [ML91] 
or to actively participate during the generation of the solution [CFPT94, Ger94]. In 
this latter case, user participation impedes these methods to be completely automatic. 

2.5 Obtained Solutions 

It should be expected that the solutions obtained by each method satisfy the 
problem addressed by that method. However, this is not always the case and it may 
happen that a method obtains a “solution” that it is not properly a solution since it 
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does not satisfy the requested update. Moreover, several solutions that satisfy an 
update request may in general exist. Therefore, another issue to consider is whether a 
method is able to obtain all solutions or not. 

- Correctness : a method is correct if it only obtains solutions that satisfy the 
requested update. As we said before, this is not always the case and some methods 
exist that may obtain non-valid solutions [KM90, CFPT94, Ger94, LT97, Maa98, 
Sch98]. We distinguish whether correctness of a certain method is formally proved 
[CHM95, CST95, T095], whether a method is not correct because it may obtain 
invalid solutions (examples of this situations will be shown in the next section), and 
whether correctness is not proved formally but we do not know of any example 
showing that the method is not correct [GL90, ML91, MT93, Wiit93, Sch96, 
Dec97]. These situations are indicated by Yes, No and Not Proved, respectively. 

- Completeness : a method is complete if it is able to obtain all solutions that 
satisfy a given update request. Again, completeness can be formally proved [T095], 
there may be examples showing that a certain method is not complete [GL90, KM91, 
Wut93, CFPT94, Ger94, CHM95, Sch96, Dec97, Maa98, Sch98] or completeness is 
not proved but no contradicting example is known [ML91, MT93, CST95, LT97]. In 
Table 2, we indicate these situations by Yes, No and Not Proved, respectively. 

2.6 Summary of Current Methods 

Table 2 summarizes existing methods for view updating and integrity constraint 
maintenance according to the relevant features considerd for each dimension. 

This table shows that only a few methods [KM90, MT93, Wiit93, T095, Dec97] 
consider both view updating and integrity constraint maintenance without imposing 
significant restrictions on the integrity constraints they can handle. However, most of 
these methods present important limitations regarding correctness or completeness 
issues, as we will discuss in the next section. 

The rest of the methods either impose significant restrictions on the constraints 
they can handle [CHM95, CST95, LT97], consider an integrity constraint checking 
approach [GL90] or do not deal with view updating at all [ML91, CFPT94, Ger94, 
Sch96, Maa98, Sch98]. Some of these methods present also some problems regarding 
correctness and completeness. 

3. Detailed Analysis of Relevant Work 

Table 2 summarizes the main aspects that allow classifying relevant work in the 
field of view updating and integrity constraint maintenance. However, there are several 
details that are hidden from the Table 2 like examples showing why a certain method 
is not correct or complete, or additional restrictions of the integrity constraints. 

This section is aimed at providing these examples and at discussing additional 
details of each particular method that are not covered by Table 2. In this section we 
analyze the methods summarized in Table 2, except [GL90, KM90, ML91] since 
these methods were already described in [T095]. 

Methods are grouped according to the problem they address. Thus, Section 3.1 
describes methods that deal with integrity constraint maintenance and view updating 
[Wiit93, CST95, LT97, CHM95, Dec97], while Section 3.2 comments on methods 
that consider only integrity constraint maintenance [CFPT94, Ger94, Sch98, Sch96]. 
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Table 2. Summary of view-updating and integrity constraints maintenance methods 










































































































































































































































Integrity Constraint Maintenance 



69 



3.1 Methods for View-Updating and Integrity Constraint Maintenance 

These methods can be classified in two different groups, according to the technique 
they apply to integrate the treatment of the view update request and of the integrity 
constraints. A first group [Wiit93, CST95, LT97] is aimed at incorporating the 
information provided by the integrity constraints into the update request and then 
unfolding the resulting expression; while the rest of methods [MT93, CHM95, T095, 
Dec97] take into account the constraints every time that a new update is considered. 
These groups are described, respectively, in Sections 3.1.1 and 3.1.2. 

3.1.1 Methods that Extend the Update Request 

These methods clearly distinguish two steps to obtain the solutions. Given an 
update request U, the first step is aimed at obtaining a formula F, defined only in 
terms of base predicates, that characterizes all solutions of the update request. This 
formula is obtained by incorporating information of integrity constraints in U and by 
unfolding derived predicates by their corresponding definition. In the second step, the 
obtained formula is analyzed to determine the base fact updates of the solutions. 
Wiithrich 's Method fWut93 1 

This method characterizes an update request as a conjunction of insertions and 
deletions of base and/or derived facts. A solution is then characterized by the set of 
base facts to be inserted (I) and base facts to be deleted (D) that satisfy the requested 
update. The general approach to draw the solutions follows the two steps approach 
outlined before. This method has two main limitations: it is not complete and it does 
not necessarily generate minimal solutions. 

- There exist solutions that are not obtained by the method'. Wuthrich’s method 
assumes that there is an ordering for dealing with the deductive rules and integrity 
constraints involved in the update request, which will lead to the generation of a 
solution. However, this ordering does not always exist. The following example shows 
that this assumption may impede this method to obtain all valid solutions. 

Example 3.1: Given the database: 

Node (A) Node (B) Edge (A, B) Edge (B, A) 

Icl <— Node (x) A -I 3y Edge (x, y) Ic2 ^ Node (x) a — i 3z Edge (z, x) 

Ic3 <— Edge (x, y) A -1 Node (x) Ic4 <- Edge (x, y) a -i Node (y) 

and the update request insert(Edge(A,C)), Wiithrich's method could not obtain the 
solution characterized by the sets I={Edge(A,C), Node(C), Edge(C,D), Node(D), 
Edge(D,B)} and D=0. This problem will mainly appear when the knowledge base 
contains referential integrity constraints, such as the above ones. 

- Non-generation of Minimal Solutions': Wuthrich’s method does not 
necessarily generate minimal solutions because it does not check whether a base or 
view fact is already present in the database when suggesting to insert or to delete it. 
This is shown in example 3.2. 

Example 3,2: Given the database: 

S(A, B) P(x) <- Q(x) A R(x) R(x) <- S(x, y) 

Given the request insert(P(A)), Wuthrich’s method could only obtain the non- 
minimal solution I={Q(A), S(A,C)}, where C is a value given by the user or assigned 
by default, although there exists another solution I={Q(A)} which is a subset of the 
previous one. 



* A solution S is minimal if no proper subset of S is also a solution. 
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Console. Sapino and Theseider’s Method ICST951 

The two steps of this method proceed as follows. In the first step, a formula F* 
obtained by considering the update request <j) and the integrity constraints. In the 
second step, F* is instantiated and simplified by considering the contents of the 
extensional database and possible values given by the user. At the end, a ground 
formula in disjunctive normal form F** is obtained characterizing solutions to the 
update request. This method presents one limitation: it only deals with restricted 
integrity constraints. 

- Restrictions on the Integrity Constraints: as shown in Table 2, the method can 
only handle flat integrity constraints. Moreover, it considers also two additional 
restrictions: constraints must be in denial form and have at most two literals in the 
body or either they must be (non-cyclic) referential integrity constraints. 

Loho and Traicevskv’s Method ILT971 

This method obtains, by means of a process of unfolding, a disjunctive normal 
formula F from a given update request. This formula is then extended with residues of 
the integrity constraints potentially violated by F. At the end, non-ground variables 
are instantiated by considering facts of the extensional database. This method presents 
two different drawbacks: 

- Restrictions on the Integrity Constraints: it requires the set of constraints to be 
resolution complete. That is, it must not be possible to derive new (implicit) 
integrity constraints from given set of integrity constraints. For instance. Id <— Q(x) 
A -1 R(x) and Ic2 R(x) a S(x) are not resolution complete since a third integrity 
constraint can be deduced from them: Ic3 «- Q(x) a S(x). The problem is that, as far 
as we know, there is no mechanism to derive sets of integrity constraints that are 
resolution complete. 

- Invalid Solutions: this method is not always correct since the formula F does 
not always characterize valid solutions, as shown in the following example. 

Example 3.3: Given the database: 

S(A, 1 ) Q(x, y) ^ P A S(x. y ) P S(x, y) a T(y ) 

Given the update request insert(Q(B,2)), this method would obtain two solutions: 
Tl={delete(S(A,l)), insert(S(B,2))} and T2={insert(T(l)), insert(S(B,2))}. However, 
none of them satisfies the requested update since insert(S(B,2)) induces P and, hence, 
it falsifies insert(Q(B,2)). Moreover, there exist two solutions: Sl={delete(S(A,l)), 
insert(S(B,2)), insert(T(2))} and S2={insert(T(l)), insert(S(B,2)), insert(T(2))} that are 
not obtained by this method. 

3.1.2 Methods that Consider the Integrity Constraints Dynamically 

Chen. Hull and Mcleod’s Method fCHM95l 

This method is based on an execution model aimed at the execution of active rules 
to update derived data in a semantic object-oriented database model. Active rules are 
described by means of Limited Ambiguity Rules (LAR), a kind of condition-action 
rules, that are automatically obtained from the database schema. Two kinds of LAR 
rules are considered: the upward rules propagate changes from base classes to derived 
classes, while the downward ones propagate changes in the opposite direction. 

Given an update request A, LAR rules are executed according to the Principle of 
Down-Up Propagation, which states that all downward rules must be always executed 
before upward rules. The main limitation of this method is the following: 

There exist solutions that are not obtained by the method: in some cases this 
method can not obtain all correct solutions, as shown in the following example. 



Integrity Constraint Maintenance 



71 



Example 3.4: Given the database schema: 

CLASS P derivation: Q and R DB = {(S, has-instance, Idl), (R, has-instance, Idl)} 

CLASS R derivation: S or T 

where Q, T and S are base classes; P and R are derived classes; object Idl belongs to 
class S and to class R. Derived class P is defined as the Join of Q and R, and class R 
is defined as the union of S and T. This method cannot obtain any solution to the 
update request of inserting an object Idl to class P and deleting the same object of 
class S: A = {insert(P, has-instance, Idl), delete(S, has-instance, Idl)}. However, there 
exists a solution A = {delete(S, has-instance, Idl), insert(Q, has-instance, Idl), 
insert(T, has-instance, Idl)}. 

Decker’s Method [Dec96. Dec97J 

This method is based on the SLDAI resolution procedure, which is an abductive 
extension of the SLD resolution procedure. The SLDAI procedure is an interleaving of 
refutation and consistency derivations. Given an update request, the refutation 
derivation pursues the empty clause by considering the database contents. During this 
derivation, new hypotheses are included in the solution set H. Every time a 
hypothesis is included in H, its consistency is verified by a consistency derivation. 

The main limitation of this method is that it cannot manage appropriately update 
requests that involve rules with existential variables. The reason is that the refutations 
flounder when a literal corresponding to a non-ground base predicate is selected, thus 
impeding to reach the empty clause. 

For example, this method does not obtain any solution to the update request 
insert(P) in a database with only one rule: P S(x). Nevertheless, there exist as 
many solutions as possible values of x for which to insert S(x). 

Moreover, this method does not take into account the base facts during the 
consistency derivations. Therefore, this method may not obtain correct solutions since 
it flounders. For instance, consider the update request to insert(P) in a database: 

R(A, B) P <- Q(A) Id <- Q(x) a R(x, y) a S(y) 

Since this method does not consider base facts in the consistency derivation, it can 
not resolve them and it flounders. Therefore, no solution is obtained even though 
there exist two: Tl={insert(Q(A)),delete(R(A,B))} and T2={insert(Q(A)),insert(S(B))}. 

3.2 Methods for Integrity Constraints Maintenance 

We analyze only those methods that address only integrity maintenance and that are 
based on the generation and execution of active rules [CFPT94, Ger94, Sch98, 
Maa98], 

3.2.1 Methods based on the Execution of Active Rules 

This approach is aimed at maintaining integrity constraints through the generation 
at compile time of a set of active rules that, when executed at run-time when a certain 
transaction is applied to the database, guarantee that the integrity constraints remain 
satisfied. Active rules are generated by taking only into account the information 
provided by the database schema and their action part contains the updates needed to 
repair an integrity constraint violation. Methods that follow this approach are 
[CFPT94, Ger94, Maa98, Sch98]. Although all these methods present different 
particularities regarding the generated rules or the language used to define the 
constraints, they share the same limitations. This is why in this section we only 
explain these common drawbacks, instead of describing each method in detail. 
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- The update request is not always satisfied', one of the most common 
limitations of these methods is that the obtained solutions may not preserve the effect 
of the requested update. The reason is that they do not take into account the history of 
database updates needed to enforce database consistency and, thus, they cannot know 
whether the requested update is undone by the joint effect of these updates. 

Example 3.5 (adapted from [Sch98]): Assume the following database and the 
update request insert(Wire(Idl, HB, A, 2, 0)): 

Tube(Idl, HB, 4) Wire(Id5, HB, A, 2, 0) 

Wire(wire_id, conn, w_typ, volt, pwr) — » Tube(tube_id, conn, t_typ) 

Wire(wire_id, conn, w_typ, volt, pwr) a Tube(tube_id, conn, t_typ) -» wire_id5ttube_id 

Here, [CFPT94, Ger94] could obtain a solution T={insert(Wire(Idl,HB,A,2,0)), 
delete(Tube(Idl,HB,4)), delete(Wire(Idl,HB,A,2,0)), delete(Wire(Id5,HB,A,2,0))} that 
does not satisfy the original request. 

[Sch98] is aimed at generating active rules such that do not present this problem. 

However, in this example, it is possible to find correct solutions that satisfy the 
request and maintain database consistency, like S={insert(Wire(Idl,HB,A,2,0)), 
delete(Tube(Idl,HB,4)), insert(Tube(Id9,HB,9))} these methods are not able to obtain. 

- Not all valid solutions can be obtained: once tbe set of active rules for integrity 
maintenance is generated, these methods [CFPT94, Ger94] use to define a graph that 
expresses whether the execution of a certain rule that repairs an integrity constraint 
could violate another integrity constraint. The presence of cycles in this graph 
indicates that the process of integrity maintenance could never terminate. To guarantee 
termination, the database designer may remove some rules or to define priorities 
among them. Do not considering all active rules [CFPT94, Ger94, Maa98, Sch98] is 
equivalent to discard some potential repair and, thus, not all solutions can be obtained. 

4. Conclusions and Further Work 

We have proposed a general framework that allows to compare previous research in 
the field of view updating and integrity constraint maintenance. This framework is 
based on taking into account the five relevant dimensions that participate into this 
process, i.e. the kind of update requests, the database schema considered, the problem 
addressed, the solutions obtained and the technique used to draw these solutions. The 
main methods proposed up to now have been analyzed according to these dimensions. 

From our study, we may conclude that the research in this area is still concerned 
with providing effective methods, i.e. methods able to obtain all valid solutions, 
without imposing strong restrictions on the considered views and integrity 
constraints. It may be the reason why efficiency issues have been neglected in this 
area, although particular exceptions can be found [CFPT94, Ger94, FP97, MT97]. 

We believe there is still a huge amount of future research to be pursued in this 
area. On one side, effective methods are required to be able to approach the usefulness 
of these techniques in practical applications. Otherwise, if a method fails to obtain a 
solution it is not possible to know whether there is no translation or whether there is 
some but the method is not able to find it. On the other, efficiency must be carefully 
addressed if we want this technology to be of practical use. 
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Abstract. In this paper we show how temporal databases can be spec- 
ified and implemented using the bitemporal event calculus, an exten- 
sion of the event calculus that includes both valid and transaction time, 
and the possibility to perform temporal updates. A caching mechanism 
that maintains the current historical state and is updated after each 
transaction has also been incorporated. We also consider the problem of 
checking integrity constraints in this kind of temporal databases. The 
methodology for consistency checking presented here is an extension of 
other approaches found in the literature that exploit the assumption 
that the database satisfies its integrity constraints prior to the update 
transaction. A prototype of the formalism and the checking mechanism, 
implemented in PROLOG, has also been developed. 



1 Introduction 

Time often shows up in information systems, as these are, after all, models 
of some part of our reality. However, despite a great deal of work in the area 
of temporal databases^, most database management system implementations 
do not have built-in time support. The reason for this is the complexity of the 
semantics and representation of time. Having a logical specification of a temporal 
database and being able to study its evolution through a sequence of transactions 
would be useful to shed some light on the topic. 

In this paper we briefly show how an extended version of the event calculus 
[8], a logic programming formalism for specifying event occurrences and change, 
can be used to both specify and implement temporal databases. An extended 
account can be found in [10]. This new formalism will allow us to have a logical 
specification of a temporal database, with a logic programming semantics and 
the possibility of computing directly from the specification. Our specification 
has, among others, the following features: 

— Handle two (orthogonal) time lines: valid and transaction time. 

* This research was supported by FONDECYT (Grant #1980945) and 
ECOS/CONICYT (Grant C97E05.) 

** Casilla 360, Santiago 22, Chile, {camareco, bertossi}Qing.puc.cl 
^ We use the terminology of Snodgrass and Ahn [16] regarding time and databases. 
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— Support temporal updates similar to those found in conventional temporal 
databases. 

— Allow queries involving both time lines. 

— Allow correetions to be made to the information contained in the Database. 

— Efficient query answering and integrity constraint checking. 

The rest of this paper is organized as follows. First, we briefly introduce the 
event calculus. Sect. 2 summarizes some aspects of the bitemporal event calculus, 
BtEC, presented in [10], in particular, it shows how to add transaction time and 
temporal updates to the simplified event calculus. Section 3 explains how a 
caching mechanism can improve the efficiency of the underlying computational 
mechanisms. Integrity constraint checking in the bitemporal event calculus is 
analyzed in Sect. 4. Finally, in Sect. 5 we discuss some related work. 



1.1 The Event Calculus 

The event calculus was introduced by Kowalski and Sergot in [8] as a formal- 
ism for reasoning about events and change in a logic programming framework. 
However, after the original paper was published, most of the work on the event 
calculus centered on a simplified version [6,13], called the simplified event cal- 
culus. 

In the simplified event calculus, events initiate and terminate properties of 
the real world. Updates are of additive nature only, and they are specified by 
adding facts of the form: happens. at {Event, Time), where Time is the valid (or 
real world) time of occurrence of the event. Events are related to the properties 
they initiate/terminate by predicates of the form: 

initiates. at{Event, Property, Time), terminates. at {Event, Property, Time). 
The simplified event calculus is composed of the following general rules: 

holds.at{p,v) <— happens. at {e,V\), ui < v, initiates. at{e,p,vi), 
not broken{p, [ui , u]). 

broken{p, [vi,V 2 ]) happens. at {e,v), vi <v < v^, terminates. at {e,p,v). 

The holds.at rule is used to ask about the validity of some property at a par- 
ticular time point. It states that a property p is valid at time v if there exists 
an event e which occurs at a time vi less than v, e initiates p and the interval 
[vi , u] for p is not broken. An interval for a property is broken if there exists an 
event which terminates the property and occurs within the interval. Negation-as- 
failure is used to express that, by default, the effects of an event which initiates 
a property extend until it is interrupted by another event which terminates the 
property. Since the simplified event calculus allows us to deduce the (valid) time 
periods during which properties hold, it can be used, as pointed out in [18], to 
formalize a historical database, albeit without a facility for correcting errors. 
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2 The Bitemporal Event Calculus 

As each event is added, the database goes through different states (a state is 
a set of properties with their valid time intervals), each state being a historical 
database in itself. However, unlike a temporal database, when an event is added, 
it destroys the previous state of the database, making it impossible to derive the 
valid time intervals which held in previous states. 

In order to access a certain state s, only events which were introduced up to 
that state s must be considered by the rules. A transaction time can be associated 
to every event, and events are specified as facts; happens -at {event, valid time, 
transaction time). 

In the simplified event calculus, events specify only one time point, namely, 
the time at which the event occurs in the modeled reality. Therefore, an event 
may only determine the start or end of a property’s valid time interval. 

However, several temporal data models and databases, such as TQuel [14], 
ATSQL2 [20], BCDM [5] and Chronolog [1], provide updates that specify valid 
time intervals rather than time points. This is so because it is quite common 
to refer to time intervals, for example, to say that an employee’s salary will be 
X from September to December. In addition, it is still possible to simulate the 
behavior of the simplified event calculus using intervals, simply by specifying 
that the interval has an infinite upper bound. 

Conversely, it is also possible to specify intervals with time points by us- 
ing two events per interval [18]. This, however, quickly becomes complex for 
simple and natural cases. For these reasons, we chose to use valid time inter- 
vals instead of time points directly in the events, which are now specified as: 
happens -at {event, [vi,V 2 ],tx), where [vi,V 2 ] is the valid time interval, and tx is 
the time when this occurrence is introduced in the database. The semantics of 
this transaction is: the event occurs at valid time point v\ and its effects ex- 
tend up to V 2 - It is not an event with duration; events are still considered to be 
instantaneous, the only difference is that now their effects do not extend until 
infinite as before, only up to V 2 - 

The first modification required is to have the rules filter out events according 
to the transaction time specified in the query. Other changes are necessary to 
account for the time intervals in the events. The basic rules of our formalism are 
the following: 

holds -at{p,vt,tx) •<— happens-at{e,\os,Ve\,tin), tin < Vg <vt < Vg, 

initiates-at{e,p,Vg,tx),not broken{p,vt,[tin,tx]). (BtECl) 

broken{p,v,[tg,te]) 4 - happens-at{e,[vg,Ve],tin), tg < <tg, Vg < v < Vg, 
terminates-at{e,p,Vg,te). (BtEC2) 

The holds-at rule states that a property p holds in a valid time point vt and 
transaction time tx if there exists an event e which initiates the property and 
is introduced in the database in a transaction time < tx with a valid time 
interval [us,Ue] that contains vt, and the property is not terminated at that 
valid time point from the moment {tin) when the event was introduced in the 
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database until the time (tx) specified in the query. Predicate holds.at is like the 
“timeslice” operator offered by some temporal data models. 

The broken rule states that a property p is interrupted at a valid time point 
V if there is an event which occurs in a transaction time between ts and te, has 
a valid time interval containing v and terminates the property. This means that, 
unlike the simplified event calculus, where the order (transaction time) of the 
events is not considered, if a property is initiated by an event at a transaction 
time t, events that occurred at transaction times less than t cannot affect the 
property. 

There is also a rule that defines the predicate mholds -for {Prop, Validinterval , 
Trans. Time), which derives the valid time intervals of the properties for a given 
transaction time directly from the event occurrences.^ 

On the basis of the formalism presented so far it is possible to support four 
kinds of update on a temporal database [14,1]: insertion, deletion, replacement 
and modification. Notice, however, that since a single event can initiate or ter- 
minate several properties, one event can produce effects that are a combination 
of one or more of these types of update. 

Insertion and Deletion: are normally achieved by means of events that initiate 
(terminate) a property as in Sect. 1.1. Another way of accomplishing this type 
of update, that is similar to conventional databases, is by defining two domain 
independent events, ins and del, that simply initiate and terminate respectively, 
the specified property. They can be specified by initiates. at (ins{p),p,v,t) and 
terminates. at{del(p),p,v,t); meaning that any event ins(p) (del(p)) occurring 
at valid time v and transaction time t initiates (terminates) property p. They can 
be used for error correction. We are neither deleting nor revising the previous 
events, but only overriding their effects on the property.^ 

Replacement: In a replacement, properties in given intervals are replaced by 
other properties in possibly different intervals. A way of specifying a replace- 
ment, e.g. an error correction, is by means of an event, replace, of the form: 
happens. at (replace{pi, [usi,uei],p 2 )i [vs 2 ,ve 2 ],tx), meaning that pi, in the in- 
terval [usijWei] will be replaced by p 2 , in the interval [vs 2 ,ve 2 ]. This event can 
be expressed in terms of ins and del events and vice versa. 

Modification: similar to the UPDATE sentence in SQL. It means updating the 
value of a property to another at a given valid time interval. It could be seen as 
a particular case of a replacement, where the valid time intervals coincide. Most 
modifications are realized by domain-dependent events, but it is also possible 
to represent the operation explicitly, using a new event, update, specified by 
happens. at {updB.te{pi,p 2 ), [vs,Ve],tx). 

^ Intervals determined by this and the previous rules are open to the left and closed 
to the right. The time point to the right may be infinity. Coalescing of intervals is 
not done automatically by these rules, but there is another one in the BtEC for this 
purpose. 

® These events are ordinary events, at the same object level as the domain dependent 
ones, and are properly handled by the general rules of the BtEC presented before, 
in particular, they share the same logic programming semantics of the others. 
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Table 1. Transactions entered in the database 



Event Valid Int. T.Time 



/iire(‘therese’, 5873, ‘bahnhofstrasse’, ‘zurich’) 


[02.01, 


inf] 


1.1 


ins(saZar2/(5873, 3630)) 


[02.05, 


inf] 


2.1 


hire(‘franziska’, 6542, ‘rennweg’, ‘zurich’) 


[02.01, 


inf] 


3.1 


ins(saZar2/(6542, 3200)) 


[02.01, 


inf] 


4.1 


/itre(‘lilian’, 3463, ‘speedway’, ‘tucson’) 


[02.02, 


inf] 


5.1 


ins(sa/ar2/(3463, 3400)) 


[02.02, 


inf] 


6.1 


ratse(6542, 3800) 


[06.01, 


inf] 


7.1 


replace(sa/ary(3463, 3400), [02.02, inf]. 


[02.02, 


inf] 


8.1 


salary{3463, 3500)) 








update( 


[08.01, 


inf] 


9.1 



emp/oyee(‘franziska’, 6542, ‘rennweg’, ‘zurich’), 
emp/oyee(‘franziska’, 6542, ‘nieder’, ‘zurich’) ) 



Example 1: (Based on [17]) There are two tables: employee(ename,eno, street, - 
city) and s alary (eno, amount) and two events, hire and raise, defined by: 

initiates. at{raise{eno, amt), salary{eno, amt), v, t). 
terminates.at(raise(eno, amt), salary(eno, amt') , V, t) <— amt ^ amt', 
initiates. at {hire{eno, name, st, cty), employee{eno, name, st, cty),v, t). 

Also, it is assumed that the transactions'* in Table 1 have been entered in the 
database. Next, we show some queries on this database. These are expressed as 
Prolog rules. 

1. List the salary history of the employees. 

ql(E,S,I) mholds_for(employee(E,N,_,_) , II, 10.1), 

mholds_for(salary(N,S) ,12,10. 1) , intersection! [II , 12] ,1) . 

In this query the employee and her salary are extracted, along with their 
respective valid time intervals using the mholds.for rule. Then the intersec- 
tion of the valid time intervals is computed. This is equivalent to performing 
a temporal join. The results of this query are: 

franziska 3800 [8. 01, inf] 1 franziska 3800 [6.01,8.01] 
therese 3630 [2. 05, inf] I lilian 3500 [2. 02, inf] 

franziska 3200 [2.01,6.01] I 

2. List the employees for which no one makes a higher salary in a different city. 
This query can be expressed as: 

The times shown (e.g. 02.01) represent dates in (month. day) format. 
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q2(E,I) raholds_f or (employee (E, N, C) , II, 10.1), 

mholds_for(aalary(N,S) ,12,10.1) , intersection([I2,Il] ,In) , 
(auxq2([N,S,C] ,13) -> dif f (In, 13, I) ; I = In) . 

auxq2( [N,S,C] , I) mholds_f or (salary(Nl ,S1) , Iv2, 10 . 1) , 

N \== Nl, mholds_for(employee(_,Nl,_,Cl) ,Ivl,10. 1) , 
intersection( [Ivl,Iv2] ,1) , SI > S, Cl \== C. 

The results are: 

frcuiziska [2.01,2.02] I liliein [2.02,2.05] 

franziska [6.01,8.01] I therese [2. 05, inf] 

franziska [8. 01, inf] 



3 A Caching Mechanism for the BtEC 



The event calculus performs a lot of redundant computation when answering 
queries, since every time a query about a property is posed, the whole com- 
putation is repeated from the beginning, even though the property might have 
not changed [2], One way to improve its efficiency is to add a caching mecha- 
nism, as in the cached event calculus (CEC) [2], to store the valid time intervals 
of the properties and update them efficiently whenever an event is added to 
the database (the undesirable alternative would be to compute everything from 
scratch after every transaction.) Thus, when a query is formulated, the intervals 
are already available in memory and they do not have to be computed by the 
rules. 

We extended and adapted the CEC to the case of interval-based updates 
and two time lines, i.e. the formalism of Sect. 2 [10]. The idea is to keep in 
memory the set of valid time intervals corresponding to the current (relative to 
transaction time) historical state, and update it with every transaction. That is, 
the cache keeps an historical database at the current transaction time. The cache 
is computed by means of rules in the BtEC specification. It consists of a collection 
of facts of the form mholds. for {Prop, Validinterval), without transaction time. 

Since most of the queries posed to a temporal database refer to the current 
transaction time, these can be answered just by looking up in memory the rele- 
vant intervals. The rules given in Sect. 2 are executed only when a query refers 
to some past state. There is a price to be paid, of course, the time required to 
perform the update does increase, since the cache needs to be modified to reflect 
the effect of the event that was introduced, but the performance gained in an- 
swering queries makes the combined cost lower than in the non-cached version 
[2]. 

Integrity constraint checking is also improved by using the cache, as will be 
seen in Sect. 4. 
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4 Integrity Constraint Checking 

We consider only temporal integrity constraints which refer to the current his- 
torical state, that is, to properties with arbitrary valid time intervals as time 
stamps, but only with a fixed transaction time (the current one). 

A way to check the consistency of a database is simply to transform the in- 
tegrity constraints into their denial forms, as queries about their violating tuples, 
and pose them after each transaction. This is extremely expensive in the sim- 
plified event calculus, where the valid time intervals must be derived each time 
they are required. Then, an important improvement can be achieved by using the 
caching mechanism described in Sect. 3. Since most integrity constraints refer 
only to the current historical state (which is kept in the cache), the verification 
process takes place entirely in the cache, what is much more efficient^. 

Further improvements on integrity constraint checking can be made by ex- 
ploiting the assumption that the database satisfies its integrity constraints prior 
to a transaction. We present here such a methodology. It can be seen as a special- 
ization of the one proposed in [7] for checking integrity constraints in a deductive 
database. In [7], an extension of the SLDNF resolution based proof procedure 
with negation as failure [9] is used to reason forwards firom the update, focusing 
on the relevant parts of the database and constraints. It achieves the same effect 
of the simplification algorithms of Nicolas [11]. 

The idea behind our methodology is to take advantage of the fact that, when 
the caching mechanism updates the intervals, it gathers information regarding 
which properties were modified as a direct or indirect® result of the transaction. 
After the cache has been updated, only those integrity constraints which con- 
tain properties that were modified me considered for verification. Out of these, 
only some, depending on the modified property and the integrity constraint, are 
effectively considered for checking. The modified properties are used to instanti- 
ate variables in the integrity constraints, which reduces the search space. Also, 
in many cases the integrity constraints can be simplified, so only “residues” of 
them are actually checked. 

As explained before, we consider only temporal integrity constraints which 
refer to the current transaction time (the time of the cache) and we assume that 
they are written as denials. Thus, they have the form: 

<- mholds-for{pi{xi),Ii),... ,mholds.for{pn{xn),In), 
f{xi,h,... not aux(xi,Ii,. . .X~n,In)-] 

aux{xi,Ii,. . . Xn, In) rnholds.for{qi(yi), /[), . . . mholds.for(qm{ym), /„), 

<f>(^XijIi^... , Xn , In iVlj I\j • • • ) • 

® Integrity constraints which refer to different transaction times in addition to Vcdid 
time do not benefit from this approach. 

® In case a precondition for a property being initiated is terminated by a new event, 
the property ceases to hold. 
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That is, some properties (pi) appear in non-negated mholds-for clauses and 
some others {qi) within a negation {aux rule) J Notice also that we omitted the 
transaction time from the mholds-for predicate here because that is the time of 
the cache by default. 

Example 2: Temporal Referential Integrity Constraint. 

We can express this integrity constraint (for a concrete case) as: 

i—mholds-for(salary{e,s), I), not aux{e,I). (TRI) 

aux{e, I) i-mholds -for (employ ee{e), Ii),contains(Ii , /)., 

which says that the integrity constraint is violated if there is a tuple (e, s) with 
a valid time interval I in the salary table, such that e does not exist in the 
employee table with an interval that contains I. In this case it is necessary to 
use the auxiliary rule for aux to avoid floundering; otherwise we would have had 
the free variable, 7i , inside a negation. □ 

Before describing our procedure for integrity constraint checking, let us no- 
tice that no matter how complex the events are (either domain dependent or 
updates), at the end, wrt the current transaction time, everything reduces to 
the introduction into the cache of certain properties valid at certain intervals, 
or the removal of them. The procedure, summarized in Fig. 1, is based on this 
observation. 

Notice that an integrity constraint is checked only when a property which 
appears non-negated (respectively, within a negation) in it is introduced (respec- 
tively, removed)®. Under the assumption that the database satisfies its integrity 
constraints prior to the transaction, that is, they are true in the cache we had 
before the execution of the new event or update, the introduction (removal) of a 
property which occurs negated (non-negated) in an integrity constraint cannot 
falsify it. 

Nicolas, in [12], has shown that integrity constraints in prenex conjunctive 
normal form® which do not contain a relation 7? in a negated atomic formula (re- 
spectively, non-negated) are unaffected when a tuple is introduced (respectively 
removed) into (from) R. The cases in which Nicolas says it is not necessary to 
check the integrity constraint are exactly the cases we are not considering^®. 

The way the integrity constraints are checked (unifying the occurrence of 
the modified property in the integrity constraint and, possibly, deleting that 
occurrence) resembles the refutation that takes place in [7j. However, since our 

^ The reason for introducing the negations in this form is to avoid floundering, that 
is, a call to negation as failure with uninstantiated variables [9]. 

* We use the words “introduced” £ind “removed” for the operations on the cache, 
instead of “inserted” and “deleted” that are reserved for operations on the database. 
® That is, formulcis that consist of a prefix of quantifiers followed by a quantifier free 
formula that is a conjunction of disjunctions. 

Remember that we are de£iling with integrity constraints in denial form. Integrity 
constrmnts that are not written as denials can be transformed into this form by 
using the well-known Lloyd- Topor transformation rules [9j. 




82 



C.A. Mareco, L. Bertossi 



For each modified property r{a): 

For each integrity constraint IC: 

If (r{a) was introduced in the interval [c,d] and r appears non-negated in IC): 

- Unify the clause mholds-for{r{x), [j/, z]) in IC with 
rnholds.for{r(a),[c,d\). Call the resulting (instantiated) rule ICa- 

- Delete mholds.for{r{h),[c.,<I\) from ICa- Call it IC'a- 

- Check IC'a o-g<^inst the updated cache. 

If (r{a) was removed and r appears within a negation in IC); 

- Unify the clause mholds.for{r{x), [y, z]) in IC with 
mholds-for{r{a),[c,d\). Call the resulting (instantiated) rule ICa- 

- Check ICa against the updated cache. 



Fig. 1. Integrity constraint checking procedure 



database consists only of mholdsjor facts (there are no intensional rules), we do 
not need to consider the cases in which deductive rules are inserted or deleted 
nor the implicit changes in the intensional knowledge which could occur as a 
result of a database update. Thus we are left with a simplified methodology well 
suited to checking a large class of integrity constraints, those that only refer to 
the current historical state, in the bitemporal event calculus. 

Example 3: Consider the integrity constraint (TRI) of the previous example. 
Suppose the database is empty and the fact that John is hired in the interval 
[5,20] is then recorded. Since the fact employee (John) was inserted and employee 
appears within a negation, the integrity constraint cannot be violated by this 
transaction. 

Now we issue a transaction that states that John’s salary will be 40k during 
the interval [10,20]. In this case, since the inserted property appears non-negated 
in the integrity constraint, it must be checked. After performing the unification 
and deleting the (unified) occurrence of salary in the integrity constraint, we 
must try to show that: not aux(john, [10,20]), which is not possible. This means 
the integrity constraint was not violated by the transaction. 

Alternatively, if we had entered that John’s salary would be 40k during the 
interval [10,30], then: not aua;(john, [10,30]) would have been true, showing that 
the integrity constraint was indeed falsified. □ 

Every time an event or update is executed on the database, the integrity 
constraints are checked against the cache on which corresponding introductions 
and removals are performed (using the methodology already described)^^. Now, 
to update the cache, a replace update on the database is transformed into a del 
followed by an ins on the database, and they have corresponding effects on the 
cache. Our specification of the continuous checking of the integrity constraints 



11 



This continuous checking is necessary for being sure that the key assumption that 
the integrity constraints are satisfied before every event or update execution holds. 
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would not allow to check the integrity constraints after the first, del, part of the 
replace, but only after the second, ins, part has been completed. 

The checking procedure introduced above can be easily implemented right 
into the caching mechanism. We used PROLOG for this. The mechanism must col- 
lect all the properties that were inserted (along with their intervals) and deleted 
as a result of a transaction into two lists: added and deleted. The integrity con- 
straints are specified as rules whose heads contain two lists: the list of properties 
that appear non-negated (and their intervals) and the list of properties that ap- 
pear negated (in this case, no intervals are needed) in the integrity constraint, 
as follows (n is simply the number of the integrity constraint): 

non-negated properties negated properties 

s / ^ S 

ic{n,[ (Pl(xi) -/l),... ],[gi(jil),... ,9m(2/m)]) <- 

mholdsJor{pi (xi ), /j ),..., /(pi (xi ), /i ,...), not aux{pi (xi), h, . . .). 

To perform the checking, each integrity constraint is considered separately. If the 
list of non-negated properties has an element in common with the added list, the 
occurrence of the element in the integrity constraint is unified with the modified 
property, then deleted from the integrity constraint and the remaining integrity 
constraint is checked. If the list of negated properties has an element in common 
with the deleted list, the occurrence of the element is unified with the property 
and the instantiated integrity constraint is checked. 

Example 4: The salary of an employee cannot decrease. 

This integrity constraint can be formulated as: 

ic(l, [(salary(E,S)-I) , (salary (E,S1)-I1)] , []) :- 

mholds_f or (salary(E,S) ,1) , mholds_f or (salary(E,Sl) ,11) , 

S < SI, lessEq(Il , I) . 

Which means that there cannot be an employee with salaries S (in the interval 
I) and SI (in the interval II) such that the the lowest salary (S) occurs in an 
interval I > II. 

The head of the rule contains the integrity constraint number, 1, the list 
of properties that appear non-negated in the integrity constraint (in this case, 
both occurrences of salary) and the list of negated properties (which is empty in 
this case). Suppose that the cache contains mholdsJor(salciry(john, 40000) , 
[5,20]). If a transaction stating that John’s salary will be 30000 in [20,30] 
(which violates the integrity constraint) is entered in the database, then the 
added list will contain [salary(john,30000)- [20,30]] . Since the the list of 
non-negated properties does have elements in common with the added list, we 
unify the first element of the non-negated list with the sole element of added 
(this instantiates the variables E, S, I). Then we delete the first occurrence of 
salary in the right hand side of the rule and try to prove 

mholds_f or (salary (John, SI) ,11) , 30000 < SI, lessEq(Il, [20,30] ). , 

which is true, since the cache contains mholdsJE or (salary (John, 40000 ) , [ 6 , 20 ] ). 
This indicates that the transaction violated the integrity constraint. 
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5 Further and Related Work 

Our BtEC based specification of a temporal database supports the essential fea- 
tures of available and well-known temporal database specifications, like TSQL2 
[15]. Some important differences with TSQL2 have to do with the fact that the 
BtEC has a full logical specification. In particular, automated reasoning is pos- 
sible from the specification. BtEC also supports a much wider class of updates, 
the ones that are domain specific and correspond to the primitive events of the 
event calculus. Also more general integrity constraints are supported by BtEC. 
On the other side, implementation and optimization issues, already addressed in 
TSQL2, are still and open subject in BtEC^^. 

S. M. Sripada, in [18], extends the event calculus to support transaction 
time. His framework is based on the original event calculus [8], and is capable of 
handling both proactive and retroactive updates, as well as correction of errors 
(by means of a revise event, which eliminates all the effects of a given event). 
In [19], an efficient implementation of the event calculus for temporal database 
applications is presented. This approach, called Digest, stores all the information 
about the time periods than can be deduced from the database, and, whenever a 
new event is added, the Digest is updated incrementally. The problem of integrity 
constraint checking is not addressed. 

Our formalism, being based on time points instead of named intervals, is 
much simpler than Sripada’s. BtEC (without cache) consists of only 9 rules, as 
opposed to Sripada’s 24 rules. Updates in BtEC can be done by specifying time 
intervals rather than pairs of (possibly artificial) events, which is easier to specify 
and more natural and similar to traditional database updates. Also, all types of 
update that appear in temporal databases are available in BtEC. Corrections 
in our model are made with a replace event, which allows the replacement of 
a single property as opposed to revising an event (which alters all properties 
affected by it). 

The caching mechanism of Sect. 3 serves a similar purpose to the Digest 
approach, improving the query answering response time. However, we also take 
advantage of the caching mechanism to enhance integrity constraint verification. 

With respect to integrity constraints involving both valid and transaction 
time, in particular, temporal specialization dependencies [4], it is possible, as 
shown in [10], to optimize the BtEC specification on the basis of a compilation 
of the integrity constraints into the BtEC rules. 

In a temporal database, multiple values for a single data item may appear, 
when different values are assigned to it at different transaction times and in the 
same valid time point. Thus, a problem of simultaneous value semantics, SVS, 
appears [3]. As shown in [10], it is possible to specify and implement SVS in an 
extension of the bitemporal event calculus with relatively few changes, including 
decision time [3], that is used to determine the ordering of events. 

Nevertheless, a prototype implemented in Prolog, that incorporates all the fea- 
tures mentioned in this paper (and others) has been developed [10]. Wrt integrity 
constraint checking, a rollback predicate has been implemented. 
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Abstract. Extremely large and ever growing databases lead to severe 
problems in storage, database administration, and performance. The in- 
tegration of archives into database systems represents a new approach to 
relieve a database of rarely used data. Archive data is separated from the 
database logically and physically and can be placed on tertiary storage. 
Time-related aspects play an important role in the development of func- 
tionality and a suitable infrastructure for archiving. One of these aspects 
stems from schema changes in database or archive which should not eif- 
fect already archived data. We show how the impact of these changes 
can be handled by archive schema versioning. 



1 Introduction 

Databases of an order of hundreds of gigab3des can be found in many integrated 
business application environments. In scientific and other special domains even 
terabytes of data are not uncommon anymore. However, very large and growing 
databases also cause problems. The cost of secondary storage is not negligible, 
database administration becomes more difficult and, above all, performance may 
suffer disproportionately. 

Archiving in database systems represents a new approach to tackle these 
problems. It is based on the observation that not all data is used permanently and 
with the same intensity. We will use the term operative data for data frequently 
processed by application systems. On the other hand, non-operative data is rarely 
used but has to be available in the long term. Outdated business or engineering 
data, for example, cannot be deleted because of legal or economic reasons [4]. 
Raw data collected in a scientific environment may be non-operative for some 
time when processing is deferred. These logical differences are used to separate 
operative and non-operative data physically as well. Non-operative data can be 
moved from the database into an archive (Figure 1). Note that rarely used data 
may become operative again. Therefore, it is possible to restore archived data 
to the database. An archive can also be queried directly. 

Aspects of time play an important role in our archiving concept. Data ver- 
sioning is used to express historical connections between archived data and to 
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Fig. 1. Adding an archive to a database 



ensure their authenticity. A special form of schema versioning is used for archives 
to handle changes of an associated database schema. Some results of research in 
temporal databases can be applied and adapted for these purposes [12]. Funda- 
mentals of schema versioning were studied in [9, Ij. 

Our contributions towards a fully-fledged archiving service focus on a seam- 
less integration of both archives and archiving functionality into database sys- 
tems. The concepts developed for a functional integration of archiving into rela- 
tional database systems were formalized by a compatible extension to SQL/92 
we named Archive SQL (ASQL). This language is currently being implemented 
as part of the Archive Management System prototype (AMS). Section 2 of this 
paper introduces our approach to archiving. The impact of changes to a database 
schema on associated archives is the objective of Section 3 introducing archive 
schema versioning. Section 4 concludes and discusses future work. 



2 The Archiving Approach 

2.1 Architecture 

The only form of archiving found in application environments today provides 
archiving functionality by application programs or dedicated modules running 
on top of a database management system (DBMS). Our approach of database 
system integrated archiving aims to provide adequate functionality as a new ser- 
vice of the DBMS [10]. Here, archives are managed by the DBMS in addition 
to databases. Archived data is separated from the database logically and physi- 
cally while still remaining under control of the DBMS. Processing of very large 
amounts of data may benefit substantially from built-in archiving support in the 
DBMS. Data administration should be much easier, because archives are man- 
aged by the DBMS. In contrast, administrating archives outside the database 
system would require additional tools for bookkeeping, integrity enforcement, 
and so on. System integrated archiving is also expected to have better per- 
formance than system based archiving, because more data processing may be 
left to the DBMS and no data has to leave the database system. Additionally, 
the DBMS can provide operations and integrity constraints referring to both 
database and archive. Thus certain relationships between data in database and 
archive can be exploited and preserved. 
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Since integrated archives are exclusively controlled by the DBMS, common 
mechanisms for database security and safety are extended to archives. Database 
and archive are based on the same data model, e. g. the relational model. Archive 
objects can be augmented by time-related information contributing to the se- 
mantics of archived data. The access to all archived data must be maintained on 
a long-term basis and must not be affected by changes to the database schema. 
An archive should be able to survive an associated database when the database is 
dropped. Furthermore, an integrated access spanning both database and archive 
should be supported. Some design issues affecting the semantics of archived data 
can be derived from the overall aims of archiving: 

— Database relevance 

An archive supplements the database and does not replace it. The database 
is still the primary place to store and process data. Therefore, archive objects 
are defined with respect to objects in a corresponding database. This suggests 
to identify archive objects in terms of the database. The logical and physical 
mapping from database objects to archive objects is hidden to the user. 

— Authenticity of archived data 

The semantics of archived data shall be preserved. Therefore, archived data 
must not be updated. It follows, that archive structures cannot be destroyed 
as long as associated data has to be kept. 

— Database autonomy 

Data frequently required by database applications must not be moved to an 
archive. If such data is needed in an archive as well in order to fulfil some 
design criterion, it must be copied. Operations on a database should not be 
affected or restricted by the existence of archives or archived data. 

2.2 Basics of Archive SQL 

ASQL is a language for relational database systems providing an integrated 
archiving service [7]. The language respects the design issues formulated above 
and is an upward compatible extension to SQL/92^ [5,6]. 

An archive schema consists of a collection of tables, integrity constraints, 
rules, and privileges (Section 2.3). Archive tables are defined as transaction time 
tables [11] where transaction time is derived from the time when data is in- 
serted. Tables and constraints base on objects in a corresponding database thus 
implementing the criterion of database relevance. Changes to the definition of a 
database object, an associated archive object, or the connection between them 
are handled by versioning of archive tables as introduced in Section 3. 

Figure 1 already illustrated three new data manipulation operations of ASQL. 
Data is exchanged between database and archive by archive and restore, se- 
lect is used to query archived data. Queries spanning database and archive are 
supported as well. Additional operations concern the direct insertion of new, 
non-operative data bypassing the database (insert) and the eventual deletion of 
archived data (delete). As mentioned earlier, archived data cannot be updated. 

^ In the following just denoted by SQL. 
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2.3 Archive schema definition 

Integrating archives into database systems requires the development of a suitable 
infrastructure. As outlined in this section, ASQL adds archives, archive tables, 
archive constraints, rules, and new privilege types to SQL. For further reading 
and examples of ASQL syntax see [7]. 

Archives An archive schema consists of a number of tables, constraints, rules, 
and privileges. All these objects belong to the user owning the archive. An archive 
owner is responsible for the appropriate interaction of all defined objects so that 
archived data fulfil the semantic requirements stated by some design objective. 



Archive tables and constraints Archives are an addition to a database, 
archived data usually originates from this database. Therefore, the data struc- 
tures and constraints of an archive schema should be derived from the database 
schema. ASQL supports the definition of archive tables and of unique and refe- 
rential constraints for these tables. All archived data is stored in these tables 
and subject to the specified constraints. 

The important aspect of uniform access to database and archives suggests 
that archive tables can only base on named database tables while unnamed 
results of arbitrary queries are not considered. In order to allow the exclusion of 
insignificant attributes, an archived base table can be projected to a subset of 
the attributes of the corresponding database table. 

Archive tables shall not be subject to constraints not found in the database. 
ASQL so far allows to adopt unique and referential constraints of base tables for 
corresponding archive tables thus preserving relations between data of a single 
as well as of different tables. This could be extended to other kinds of constraints 
later; unique constraints spanning database and archive tables might prove use- 
ful, for example. Several options can be included in the table definition regulating 
the direct insertion of data, the grouping of attributes to logical attributes, and 
the access to archived data. The latter two options are addressed in the context 
of archive table versioning (Section 3.2). 

Rule-based archiving ASQL allows the definition of rules for an implicit 
manipulation of data. Rule-based archiving is an indispensable concept for the 
realization of certain design issues: 

1. Archiving deleted or updated database data 

2. Time-based moving or copying of data from a database into archives 

3. Time-based deletion of archived data 

Barring referential actions, SQL currently provides no further rule concept. It 
was not our intention to develop a fully-fledged rule system similar to EGA rules 
in active database systems or triggers available in many products and planned for 
the forthcoming SQL standard [2,3]. ASQL rules are descriptive and specialized 
on archiving. However, future extensions seem possible. 
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3 Archive Schema Versioning 

In this section we want to evaluate the impact of changes to a database schema on 
associated archives thus leading to a further refinement of the concepts described. 

3.1 Schema versioning 

A database schema should be adjustable to new requirements following its initial 
definition. Three levels of support are usually considered for a system allowing 
changes to the definition of a populated database: 

Schema Modification Neither the old schema nor existing data are necessarily 
retained. 

Schema Evolution Retaining previous schema definitions is not required, ex- 
isting data is converted to the new schema when necessary. 

Schema Versioning Schema definitions and existing data are maintained. All 
data can be accessed through user definable version interfaces. 

While these definitions originate mainly from research on temporal and ob- 
ject-oriented databases [8], it should be noted that schema versioning can be 
used irrespective of data versioning. Several variants of schema versioning could 
be considered depending on the features provided by a system: 

— Data modification 

While data can be retrieved from all versions, insertion, update, and deletion 
may or may not be restricted to the current schema. 

— Creating new versions 

Not every schema change must result in a new schema version. Versioning 
could depend on the type of the changing operation, the objects affected, or 
user decision. If no new version is created, existing data is possibly converted. 

— Granularity of versioning 

Versioning so far implicitly referred to the entire database schema. In view 
of possible restrictions to some types of objects, such as data structures or 
integrity constraints, versioning could be limited to these objects. The cur- 
rent database schema would then consist of the current versions of versioned 
objects and objects not versioned at all. 

— Accessing versions 

Versions could be used as created. Another approach would be to limit access 
to version interfaces explicitly defined by a database administrator. Further- 
more, combining different versions could be useful. 

SQL as well as most database systems available today comply with schema 
evolution. Schema versioning came up within the context of temporal data mod- 
els. A non-temporal database is a snapshot database containing only current 
data and referring to a current schema. Therefore, the schema can be reorganized 
through destructive updating. Historical data in temporal databases, however, 
must not be updated. This extends to the schema level: schema changes modify 
the current state of the database and previous schema versions containing his- 
torical data are retained. Schema changes are associated with transaction time, 
because a schema defines how reality is modeled by the database. 
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3.2 Archive table versioning 

In order to assess the impact of changes to a database schema on associated 
archives, let us recall some principles guiding the design of ASQL: 

— Relevance of archives to the database 

Since archives are an addition to a database, the tables and constraints of 
an archive schema are derived from the corresponding database schema. 

— Authenticity of archived data 

Archived data and relationships between them are preserved without changes 
and must not be updated. 

— Independence of databases and compatibility to SQL 

The introduction of archives shall not change the semantics of existing 
databases. SQL operations modifying the database schema are valid in ASQL 
as well, their execution shall not be restricted by existing archives. 

Combining these aspects suggests that the snapshot semantics of SQL data- 
base schemas cannot be expanded to suit archives as well. While changes to 
the schema of a database must be propagated to associated archives, archived 
data must not be adjusted to the new schema. An analogous argument applies to 
explicit changes to an archive schema itself. ASQL therefore provides a versioning 
concept for archive schemas. 

Apart from the argument of authenticity there are technical reasons as well to 
avoid the adaption of archived data to a new schema. One of the reasons is that 
archives may become very large and can be situated on relatively slow tertiary 
storage. Adjusting a large amount of data caii seriously affect throughput and 
availability of the system. 

Versioning archive tables Archive schemas are not versioned as a whole; 
versioning is limited to archive tables and associated constraints. This is suffi- 
cient to preserve data without loss of semantics. The creation of new versions is 
automatically initiated by certain data definition statements discussed later on. 
An archive table attached to a database table has a current version. Insertion of 
data is restricted to this version thus following the snapshot semantics of SQL. 
For retrieval and deletion, all table versions can be used. 

The central objects of an archive schema are tables. Archive tables have 
version-independent characteristics such as name, associated rules and privileges, 
and certain options regulating the direct insertion of data, the grouping of at- 
tributes to logical attributes, and the access to archived data. Version-dependent 
characteristics include attributes, constraints, and the connection to a database 
table. In ASQL, all attributes are assigned directly to an archive table; versions 
of this table refer to appropriate subsets. Attributes can thus belong to sev- 
eral versions which is important for a useful cross-version retrieval. Unique and 
referential constraints, on the other hand, refer to the table versions themselves. 

Rules and privileges are not versioned at all. They may temporarily become 
inapplicable, however, when archive tables they refer to are detached from their 
database tables. Since detached archive tables do not have a current version, no 
data can be inserted. 
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Versioning process Creating an archive table for some database table 
T attaches these tables to each other (Figures 2(a) and 2(b)). An archive ta- 
ble is versioned with respect to this connection. Since the criterion of database 
relevance requires that the definitions of both tables must correspond, changes 
to database schema or archive schema are offset by our versioning concept as 
illustrated in Figures 2(c) and 2(d) respectively. In both cases, a new archive 
table version reflects the modifications. This version comprises the subsets of 
attributes and constraints from T currently activated in T^. 





(c) Modified database table (d) Modified archive table 

Fig. 2. Modifications of database tables and archive tables 



Database driven changes concern the addition and deletion of base table 
attributes and constraints as well as the modification of an attribute’s data type. 
Archive driven changes comprise explicit modifications to the sets of attributes 
or constraints taken over from the corresponding base table. 

Besides the modifications observed so far, the connection between database 
and archive table must be taken into account. An archive table may be 
detached explicitly or by revoking a privilege from T^’s owner (Figure 3(a)). A 
detached archive table has no current version; data cannot be inserted. In ASQL, 
reattachment results in a new current version of T'^ even if the definition of T 
has not changed (Figure 3(b)). The attachment is lost as well, of course, when T 
is dropped as illustrated in Figure 3(c). A new database table having the same 
name may later be attached to T^ (Figure 3(d)). 

Note that creating or modifying an archive table may cause the versioning 
of other archive tables, if the corresponding database tables are connected by 
a referential constraint. In order to avoid inflationary effects, only the last of 
possibly several versions of a table generated within a transaction is actually 
maintained at the end of this transaction. The time of versioning is based on a 
timestamp which is unique within this transaction (e. g. time of commit). 
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(a) Detached archive table 




(c) Dropped database table 




(b) Reattached archive table 




(d) New database table attached 



Fig. 3. Attachment changes 



Homogenous access and naming conflicts Archive tables and constraints 
retain the names of the database objects they are derived from. The advantage of 
this approach is a homogenous access to related objects in database and archive. 
Bringing together destructive database schema changes and the versioning of 
archive tables, however, raises some problems in this context, because the name 
of a destroyed database object can be reused. 

Consider, for example, an archive table attached to some database table 
T (Figure 4(a)). Destroying table T severs this connection as well. Archive table 

still exists, but without a current version (Figure 4(b)). Archive table may 
be attached to a newly defined table T. While this is sensible if their meaning 
is matched, table T can in general contain an entirely different kind of data 
(Figure 4(c)). A new archive table created for T has no connection to the already 
existing archive table (Figure 4(d)). 

Both archive tables shown in Figure 4(d) can be referred to by the same 
combination of table name and archive name. In order to distinguish between 
them, ASQL allows to specify any time at which the archive table needed had a 
current version. The default, of course, refers to the current time. Since at most 
one of the competing tables can be attached to a database table at a given time, 
this is sufficient for a unique identification. 



Logical attributes Problems as described in the previous section can occur on 
the attribute level as well. An attribute added to a database table, for instance, 
could receive the name of an attribute dropped earlier. These attributes may 
or may not have the same meaning, their types could differ. In an associated 
archive table such attributes belong to different versions. A query concerning 
data from both versions, however, should treat them as one attribute as long as 
their meanings coincide. This observation gives rise to the the notion of logical 
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(a) Database table and archive table 



(b) Unattached archive table 




(c) Newly created database table (d) New archive table definition 



Fig. 4. Handling of naming conflicts 



attributes. Attributes referenced in diflferent versions of an archive table can be 
grouped to yield a logical attribute used in queries. 

Figure 5 illustrates this approach and highlights the different situations where 
attributes may be grouped together. But first remember that attributes belong 
directly to archive tables. While names must be unique within versions, this does 
not apply to the table itself. Attributes are added to an archive table when they 
are activated. They cannot be deleted, just deactivated. Activating an attribute 
involves its assignment to a logical attribute. Consider, for example, the first 
version shown in Figure 5. This version results from creating the archive table at 
time 1; the three attributes activated are naturally assigned to new and different 
groups. A deactivation of two attributes at time 2 is followed by an activation 
of two attributes having the same names at time 3. Notice that these attributes 
may differ from the attributes activated earlier; they could have been dropped 
and recreated. In our example, A is assigned to an existing logical attribute while 
B forms its own group. The type of attribute B is changed at time 4 resulting in a 
new version. The assignment to the same logical attribute suggests a compatible 
change, for instance an enlargement of a character string type. Detaching the 
archive table at time 5, possibly by deleting the corresponding database table, 
leaves a gap in the version sequence until reattachment at time 15. The attributes 
activated are assigned to different existing or new groups. The version generated 
at time 16 has no effect on the existing attributes. It could have been caused, 
for instance, by constraint modifications. Following time 18, the archive table is 
without current version, all attributes are deactivated. 

Thus every attribute activated for an archive table must be assigned to a 
logical attribute according to its meaning. There are several variants to achieve 
this. The solution provided in ASQL allows to tune, how far the type of attributes 
in a group must agree. Minimal requirements are the same name and comparable 
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Fig. 5. Attributes of archive tables (example) 



types for all attributes of a group, e. g. VARCHAR and CHAR. Stronger options 
include restrictions to the same basic type, for instance CHAR, or even the same 
parameters, for instance CHAR [20] . It is also possible to prohibit grouping at all. 



Accessing versioned tables In ASQL, archive tables can be referenced in 
SELECT-statements. Versioned tables can be used for queries in different ways; 

— Actuality: Any data can be seen through the actual table structure. 

— Authenticity: Data from any version can be seen as archived. 

— History; Any data can be seen through the structure of any table version. 

— Combinations: Data from several versions can be seen through a structure 
produced by combining the structures of these versions (union, intersection). 

The first variant allows to view data according to the current modelling of 
the database, data of earlier versions is adjusted if necessary. In the second case 
archived data is retrieved true to the original. As the third variant suggests, any 
data could be viewed in general with respect to the modelling at any point of 
time. Finally, the combination of versions allows to focus on attributes common 
to the considered versions or to retain all attributes of all these versions. 

Figure 6 shows the three versions of an example table completely with some 
data. The header of each version includes the interval of validity and the logical 
attributes applicable for the version. It can be identified by any point of time 
from [12, 15)U[19, UC], UC stands for until changed describing current time 
and current version. In ASQL, UC corresponds to CURRENT_TIMESTAMP. 

The scope of an archive table referenced in the FROM-clause of a query is de- 
termined in two steps. A time interval can be specified in a first step to restrict 
both data and versions considered in a query. For example, a user may only be 
interested in data archived in 1998. Conceptually, this step results in a table con- 
taining all data from the specified period. Its structure consists of all attributes 
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Fig. 6. Versions of an archive table 



of the determined versions, grouped to logical attributes as described earlier. 
NULL is added to tuples not referring to all these attributes. Figure 7 illustrates 
the selection of data and attributes from the versions of table restricted by 
a period [14, UC], 



T'^: [14, UC] 




Ql 


i! 


BI 




n 


m 


m 


ISI 


0 


□ 


E9 


lEO 


Eg 


□ 


□ 


m 


Eg 


\m 



Fig. 7. Restricting data and versions of to [14, UC] 



The final table structure used in the query is determined in a second step. 
ASQL allows to choose all attributes fi-om the first step (Figure 7), to project to 
the logical attributes common to all versions used (Figure 8(a)), and to select 
the set of attributes of a single version. This version can be specified by a point 
of time within its valid period as illustrated in Figure 8(c). A special case often 
used is the current table version (Figure 8(b)). Note that combining disjunct 
versions as in Figure 8(c) may not always be plausible. 



T'^: [14, 


UC] 


□! 


m 


AT 


gg 


ESI 


14 


Eg 


Eg 


20 



PP 


P [14, UC] 


m 


SI 


m 


IES3 


m 


□ 


ESI 


lEQ 


Eg 


BI 


EH 


IBSli 



T"': [14, UC] 


BI 




AT 


Ei 


IB 


14 


Eg 


n 


20 



(a) common (b) current (c) T=12 



Fig. 8. Final structure of 



Structure and data of a table referenced in a query can thus be selected in 
a flexible way. This is complemented by suitable defaults. If no time period is 
specified for the first step, no restrictions apply; all versions and data can be 
used in the query. If no option is chosen for the second step, a default defined 
for the table is used. 
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4 Conclusions and Future Work 

Archiving is an approach to manage very large and fast growing databases. It 
is based on the observation that rarely used data can be moved to an archive 
to relieve the database of some load thus reducing problems of performance, 
database administration, and storage costs. In this paper we have discussed some 
aspects of an architecture for the integration of archives into database systems. 
A major challenge of database system integrated archiving is an appropriate way 
to reflect schema changes of databases and archives. We presented a special form 
of versioning for archive schemas and discussed this solution in some detail. 

A prototype called Archive Management System (AMS) is currently being 
developed to realize the concepts of integrated archiving presented in this paper. 
Further investigation is needed with respect to the semantics and integrity of 
archived data and some other aspects of integrated archiving. Finally, it may 
prove useful to examine archiving in the context of temporal databases. 



References 

1. C. De Castro, F. Grandi, and M. R. Scalas. Schema versioning for multitemporal 
relational databases. Information Systems, 22(5):249-290, July 1997. 

2. R. Cochrane, H. Pirahesh, and N. Mattos. Integrating triggers and declarative 
constraints in SQL database systems. In Proceedings of the 22nd Int. Conference 
on Very Large Data Bases (VLDB), pages 567-578, Mumbai (Bombay), India, 
September 1996. 

3. K. Dittrich, S. Gatziu, and A. Geppert, editors. The active database management 
system manifesto: A rulebase of ADBMS features. SIGMOD Record, 25(3):40-49, 
September 1996. 

4. A. Herbst and B. Malle. Electronic archiving in the light of product liability. In 
Proceedings of Intellectual Property Rights and New Technologies (KnowRight), 
pages 155-160, Vienna, Austria, August 1995. 

5. ISO/IEC 9075:1992. Information Technology — Database Languages — SQL, 1992. 

6. ISO/IEC 9075:1992/Cor.l:1996(E). Information Technology — Database Lan- 
guages — SQL — Technical Corrigendum 1, 1996. 

7. J. Lufter. Archive SQL: Eine Spracherweiterung fiir die Archivierung in Daten- 
banksystemen. Master’s thesis, Depcirtment of Mathematics and Computer Sci- 
ence, University of Jena, Germany, April 1998. In German. 

8. J. F. Roddick. Schema evolution in database systems — An annotated bibliograr 
phy. SIGMOD Record, 21(4):35-40, December 1992. 

9. J. F. Roddick and R. T. Snodgrass. Schema versioning. In R. T. Snodgrass, 
editor. The TSQL2 Temporal Query Language, chapter 22, pages 427-449. Kluwer 
Academic Publishers, Boston, 1995. 

10. R. Schaarschmidt and J. Lufter. An OTchitecture for archives in database systems. 
Research report Math/Inf/98/18, Depau"tment of Mathematics and Computer Sci- 
ence, University of Jena, Germany, June 1998. 

11. R. T. Snodgrass, editor. The TSQL2 Temporal Query Language. Kluwer Academic 
Publishers, Boston, 1995. 

12. A. U. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and R. Snodgrass. Tempo- 
ral Databases: Theory, Design, and Implementation. Benjamin/Cummings, Red- 
wood City, CA, 1993. 




Modeling Cyclic Change 



Kathleen Hornsby', Max J. Egenhofer', and Patriek Hayes^ 

'National Center for Geographic Information and Analysis 
and 

Department of Spatial Information Science and Engineering 
University of Maine 
Orono, ME 04469-5711, USA 
{khornsby, max ) ©spatial .maine . edu 
^Institute for Human and Machine Cognition, University of West Florida 
Pensacola, FL 32514 
phayes@ai .uvtf . edu 



Abstract. Database support of time-varying phenomena typically assumes that 
entities change in a linear fashion. Many phenomena, however, change 
cyclically over time. Examples include monsoons, tides, and travel to the 
workplace. In such cases, entities may appear and disappear on a regular basis 
or their attributes or location may change with periodic regularity. This paper 
introduces an approach for modeling cycles based on cyclic intervals. Intervals 
are an important abstraction of time, and the consideration of cyclic intervals 
reveals characteristics about these intervals that are unique from the linear case. 
This work examines binary cyclic relations, distinguishing sixteen cyclic 
interval relations. We identify their conceptual neighborhood graph, showing 
which relations arc most similar and demonstrating that this set of sixteen 
relations is complete. The results of this investigation provide the basis for 
extended data models and query languages that address cyclically varying 
phenomena. 



1 Introduction 

The development of conceptual models that convey how objects change over space 
and time demands continued attention from software engineers and database system 
designers. Theoretical advances in the design of data models for geographic 
information systems (GlSs), for example, have focused on increasing support for 
temporality [1-S] and spatial processes [6], including objects that experience identity 
changes [7] and objects that move [8]. At the same time, there has been an increased 
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awareness of ihc necessity for a stronger cognitive element in software design [9]. 
Particular aspects of change, however, still remain beyond the scope of current data 
models. How do these models convey, for instance, spatio-temporal change associated 
with cycles of beach erosion and accretion due to tidal fluctuations and storms, or the 
planting cycles and crop rotations that are followed by farmers on a regular basis, or 
the cycles of monsoon rains in India? Queries based on any of these example 
scenarios must incorporate the cyclic nature of the phenomenon being studied. This 
paper develops a formal model for cycles based on cyclic intervals. Although we use 
examples of cycles drawn from geographic contexts of particular interest for spatio- 
temporal data modeling, this approach is also more generally useful for other 
applications that involve cyclic phenomena. 



1.1 Linear vs. Cyclic Phenomena 

Discussions within the database community on modeling time-varying phenomena 
have resulted in many models reflecting different views of the semantics associated 
with time [10]. Numerous approaches exi.st for modeling time, although time is most 
often discussed with respect to two key structural models: linear and branching 
models of time. The most general model of time in a temporal logic represents time as 
an arbitrary, partially-ordered set [II, 12]. The addition of axioms result in more 
refined models of time [11]. In the linear model, an axiom imposes total order on 
time, resulting in the linear advancement of time from the past, through the present, 
and to the future. The branching model, also known as the “possible futures” model, 
describes time as being linear from the past to the present, where it then divides into 
several time-lines, each representing a potential sequence of events. 

Few of these models, however, explicitly treat cycles. Although current 
information systems are useful for producing a snapshot of a phenomenon at any one 
time, cyclically-varying phenomena require new solutions. The measurement scales — 
nominal, ordinal, interval, and ratio — frequently applied to geographic phenomena 
have been shown to provide less than complete coverage leaving out those 
measurements that are bounded within some range and repeat in a cyclic manner [13]. 
There are also cases of non-temporal cyclic change. Angles may at first seem to fit a 
mtio scale of measurement as there is a zero and the units are arbitrary (degrees, 
radians); however, an important characteristic of angles is that they repeat in a cyclic 
fa.shion [14]. Other examples of non-temporal cycles are color wheels and certain 
mathematical functions, such as the graphs of sine and cosine functions. The special 
nature of cycles has also been noted by cartograplicrs exploring the role of 
cartographic animation as a technique for visualizing spatio-temporal change. 
Research on temporal legends that orient the user to a particular temporal framework 
[15] utilizes, for example, a time wheel designed to support querying of phenomena 
that exhibit cyclic variations. These efforts, however, are less common than the usual 
linear treatment of change. 



1.2 Temporal Intervals 

Temporal data models arc commonly based on the primitive elements of either time 
points or time intervals. Time points typically describe a precise time when an event 
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occurred. A linear model based on time points assumes a set of time points that are 
totally ordered [12J. When precise information on time is unavailable, time intervals 
become useful constructs. Reasoning about temporal intervals addresses the problem 
that much of our temporal knowledge is relative and methods are needed that allow 
for significant imprecision in reasoning [16]. This view does not require that all 
events occur in a known fixed order and it allows for disjunctive knowledge (e.g., 
event A occurred either before or after event B). Discussions about temporal points 
and intervals relate to conceptualizations of time that are discrete. Time, however, can 
be viewed as either discrete or continuous. The cycle of temperature change over the 
years, for instance, is a continuous phenomenon. The partitioning into seasons — 
winter, spring, summer, fall — forms discrete temporal concepts. Each season, 
modeled as an interval, forms a discrete temporal entity that becomes subject to cyclic 
reasoning. These discussions relate to those on spatial object and field models in tlie 
GIS domain [17, 18]. As people shift from a conceptualization based on continuous 
phenomena to discrete or vice versa as the task demands, they similarly switch from a 
view based on continuous time to time that is discrete. 

In spite of this duality existing for many common geographic phenomena, a 
discrete model of time typically underlies most temporal database models. The 
reasons for such common usage are [11]: measures of time are inherently imprecise 
where even instantaneous events can only at best be measured as having occurred 
during a chronon, the highest resolution time unit; the discrete model is compatible 
with most natural language references to time; and any implementation of a data 
model with a temporal dimension will of necessity have to have some discrete 
encoding for time. Temporal database models also impose axioms that treat the 
boundedness of time. A finite encoding implies bounding from the left (i.e., the 
existence of a time origin) and from the right. Cycles, however, require a different 
treatment. 



1.3 Structure of Paper 

This paper focuses on modeling cyclic change. Frank [19] gives examples of nine 
cyclic interval relations, however, we show through a formalization of cyclic intervals 
and the relations between these intervals that there are more than nine relations. These 
relations are fundamental to reasoning about scenarios involving cyclic change. The 
remainder of the paper is organized as follows: Section 2 reviews and discusses the 
nature of cycles. An approach to modeling cycles based on cyclic intervals is 
introduced in Section 3. The model formalizes binary relations between cyclic 
intervals, distinguishing sixteen cyclic interval relations and their conceptual 
neighbors. An example scenario based on reasoning with cyclic intervals is presented 
in Section 4. Conclusions and future work are discussed in Section S. 



2 Cyclic Change 

The linear or branching models of lime do not treat the fact that certain events or 
phenomena may be recurring. The term cycle is used to capture the notion of 
recurring events. Conceptually, we talk about life cycles, work cycles, cycles of 
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poems or songs, and the seasonal cycle, which is perhaps the most common example 
of a cycle (Figure I). 

Cycles may affect the existence of an object [7], the properties of an object, and the 
location of an object. In certain cases, a phenomenon, such as high tide, is existent for 
a period of time, becomes non-existent, and then it reappears again This cycle is 
repeated over time. Similarly, at regular (or irregular) intervals, a water body, such as 
a pond or stream, may dry up and become non-existent before rains or high water 
levels bring it back into existence. 







Fig. 1. Seasonal activity cycle of 17th century Native peoples in Rupert’s Land, Canada (after 
[ 20 ]) 

Cycles can also be described from the perspective of cyclic changes to properties 
of an object. Examples of properties that vary cyclically include the size, shape, and 
value of an object. The population of a small college town in the US, for instance, can 
increase or decrease according to whether the University is in session (students are 
resident in town) or not. During the summer, when the University is not in full 
session, the student population is often much smaller and the town’s population is 
reduced. An understanding of the cyclic variation in population size is important in 
town planning, traffic planning, availability of accommodation, business decisions, 
etc. 

An object’s location can also vary in a cyclic pattern over time. People travel to 
their jobs each working day, for example, and then return to their homes in the 
evening, some people visit a grocery store on a regular basis, planning their excursion 
at approximately the same time every week, and trains, planes, and buses move in 
space according to schedules that are cyclic. 

A formal approach to modeling cycles based on cyclic intervals is introduced in the 
next section. 



3 Modeling Cyclic Intervals 

The embedding space for a cycle C is a connected subset of the real numbers, IR ' . 
The period neZ describes the length of the full cycle C„, such that IR’modrt 
captures all points that are part of the full cycle. A cyclic interval I is then a non- 
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empty, connected, true' subset of C„ (i.e., 7^0 and / c C „ ). If 7 c C„ , then C„ \ 7 
is 7 's complement, denoted by 7" (Figure 2). 

Given 7 c; C„, the interior of denoted by 1°, is defined to be the union of all 
open sets that are contained in C„. In this paper, we assume that the interior of a 
cyclic interval is non-empty The closure of I, denoted by 7, is the 

intersection of all closed sets that contain C„. The boundary of 7, denoted by dl, is 
the intersection of the closure of 7 and the closure of the complement of 7 (i.e., 
7n7"). 




Fig. 2. Cyclic interval A with start ( d^A ), interior ( A° ), end (d^A), and complement ( A" ) 

The boundary of 7c C„ is disconnected, i.e., there are two distinct subsets of dl, 
called start (dj ) and end (dj ), satisfying the following three conditions: 

• dJ ^ 0 and dJ 

• dj'odj = dI,ZX\A 

• djr>dj=0. 

Based on these conditions, a cyclic interval is closed. It includes neither 
separations, nor a single point, nor an entire cycle. 

The order of the underlying IR ' implies an orientation for the sequence of the two 
parts of the boundary and the interior such that dJ < 7° < dJ. The ordering of C„, 
however, does not establish an order relation, because when applied to a cyclic space 
such as C„, the order relation ^ (“before or equal") is not necessarily transitive (i.e., 
for elements a,b,ceC„, a^b and does not necessarily imply that a^c). 
Therefore, no information about the relative order of dJ and d,l can be derived 
from d,l < r <d, I . 

Subsequently, we consider only cycles that have consistently the same orientation. 
We select a clockwise orientation, although the same results would apply for a 
consistent choice of a counterclockwise orientation. 



' We exclude here intervals that would extend through the entire cycle, since we are interested 
in modeling the prototypical cyclic relations. Our approach, however, is extendible to 
intervals that span the entire cycle. 
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3.1 Binary Relations Between Cyclic Intervals 

Let A and B be a pair of cyclic intervals of the same cycle C„ (Figure 3a). 




To 0 o1 

Af=l0 0 Ol 

Ll 0 oj 

(b) 



Fig. 3. (a) Cyclic intervals A a/id B and (b) the inicrscctioii matrix based on whether the 
intersections of start, interior, and end are empty or non-empty 



The relation between A and B is described by the corresponding values of the set 
intersections of the intervals’ boundaries and interiors. Since each cyclic interval has 
three distinct, mutual ly-exclusive parts (dA , A°, SA , and d,B , B°, dfi ), there are 
a total of nine set intersections. They are concisely represented by a 3x3 matrix 
(Equation 1). 



M = 



d,Aod,D dAnBT 
A" nd,A A^nEl" 
<34, o<3,B dAf o B" 



(9,<4n dfB 
<4" nd,B 

dAf r^dfB 



( 1 ) 



Figure 3b shows an example of a cyclic interval relation and the corresponding 
matrix of empty (0) and non-empty (1) set intersections. 

From among the 2’=512 possible combinations of empty and non-empty, a set of 
sixteen cyclic interval relations are realized (Figure 4). These relations are qualitative 
in nature as they do not capture any information, for example, about the cycle’s 
periods, the lengths of the intervals, or the amount of overlap. We discuss the 
completeness of this set in section 3.3. 

The matrices capture valuable information about the comparison of the relations. 
First, matrices that are mirror images along the main diagonal identify symmetric 
relations. This holds true for relations disjoint, meetsjwice, equals, and 
overlapsjiwice. Second, pairs of matrices that are identical if one matrix is transposed 
along the main diagonal identify converse relations. Among the sixteen relations, 
there arc six pairs of converse relations; meets and met_by, overlaps and 
overlappedjby, passes and passedjby, starts and startedjby, finishes and finished_by, 
and contains and containedjby. 
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Fig. 4. The sixteen cyclic interval relations with their corresponding intersection matrices. 
Orientation is clockwise 



3.2 Conceptual Neighborhoods 

The sixteen cyclic interval relations can be grouped according to their conceptual 
neighborhoods. Conceptual neighborhoods capture the similarities among the sixteen 
relations by linking those relations that are connected by an atomic change [21, 22]. 
Such a change is a single movement of one interval’s start or end point from the other 
interval’s boundary into its interior or exterior, or vice versa, moving the start or end 
point from the interior or exterior onto the boundary. Based on a computational model 
similar to that for topological line-region relations [23], the full set of all possible 
movements has been determined. It leads to a graph that corresponds to a lattice 
(Figure 5). 

This regular figure is an indication that no relations located in the interior were 
missed. Further examination of the borders along the top and the bottom of the 
neighborhood graph (Section 3.3) demonstrate that the set of sixteen cyelic interval 
relations is actually complete, provided A and B are a true subset of IR' and none of 
their interiors are empty. 

The conceptual neighborhood graph exposes some interesting properties. 
Beginning with the case where two cyclic intervals are separate (disjoint), all diagonal 
rows of relations that run from the top left to the bottom right of a diagonal (e.g., from 
disjoint to contains) are fonned by moving the end of the outer cyclic interval in a 
clockwise direction. Diagonal rows of relations that run the opposite way — from the 
top right to bottom left of a diagonal (e.g., from disjoint to overlappedjby ) — are 
formed by moving the start of the outer cyclic interval counterclockwise. Taken 
together, these relations form a double-diamond shape. Overlapped_by is drawn twice 
to demonstrate the regularity of the structure. Based on this grouping, each relation is 
at least the conceptual neighbor of two relations (cases disjoint, containedjby, 
overlapsjtwice, and contains) and at most four other relations (cases overlapped_by, 
meetsjwice, overlaps, and equals). 
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Fig S. Planar projection of the conceptual neighborhood graph, with relation overlappedjby 
drawn twice to demonstrate the regularity 



3.3 Completeness of Set of Sixteen Cyclic Interval Relations 

The analysis of the conceptual neighborhood graph already illustrated the underlying 
regularities along the diagonals. We use this pattern to demonstrate that any relations 
located along the fringes of the graph require cases of the intervals, which would 
violate at least one of the properties of a cyclic interval. 

If one extends the diagonals beyond the border of the conceptual neighborhood 
graph, one provides 20 opportunities for additional relations (labeled A through T in 
Figure 6). If there was another cyclic interval relation, then it would have to be 
connected to the graph and would have to be located within the 20 slots. Four of these 
links point to existing cyclic relations (H to metjby, I to passed_hy, R to started_by, 
and S to finishes)', therefore, they can be discarded. From the regularity along the 
disjoint-contains diagonal — moving the end of the outer cyclic interval in a clockwise 
direction — it follows that K would be the relation in which a full cycle encompasses a 
cyclic interval (Figure 7a). Corresponding relations can be realized for the cases 0, J, 
and N (Figures 7b-d). Along the same diagonals the cases A, E, D, and T can be 
evaluated with the reverse information — from bottom right to top left, moving the end 
of the outer cyclic interval in a counterclockwise direction. This sequence implies for 
the four slots that the outer interval must collapse to a single point either in the outside 
of the inner interval (slot A, Figure 7e), in the inside (slot E, Figure 7f), or on the two 
boundaries (slot D, Figure 7g; and slot T, Figure 7h). 

The corresponding analysis can be performed along the diagonals from top right to 
bottom left. It reveals that cases L, M, P, and Q would be occupied by relations with 
complete outer cycles, while cases B, C, F, and G would require the outer cycle to 
collapse to a single point. Since none of the twenty cases are occupied by a new 
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interval relation, the set of sixteen (Figure 4) covers all possible cyclic interval 
relations. 
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Fig. 6. Set of cyclic interval relations plus cases where interval has been collapsed to a point or 
extended to a full cycle. Orientation is clockwise 




Fig. 7. Additional cases of relations where the outer cyclic interval is extended to a full cycle 
with a start or end coinciding with (a) the outside of the inner interval, (b) the inside of the 
inner interval, (c) the start, and (d) the end, or the outer interval is collapsed to a single point (e) 
in the outside of the inner interval, (0 in the inside, (g) on the start boundary, and (h) on the end 
boundary 
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4 An Example 

Cyclic relations are useful for reasoning about scenarios of change, for example, land 
use changes (Figure 8). Four different uses of land (timbering, fishing, hunting, and 
fruit gathering) vary cyclically over time with each cycle being one year in length. 
The orientation of the cycle of land use change is clockwise. The intervals 
representing the different land uses can be compared to one another (Figure 9). The 
interval for hunting, for instance, meets the interval for fishing at one end while 
overlapping at the other end. Both the ends of the interval for fishing overlap with the 
ends of the timbering interval. 




Und use for timbering 



□ Fishing 

□ I -and use for hunting 

□ Land use for foiit gathering 



Fig 8. Changes in land use: Land use includes timbering (October through mid-April), fishing 
(March through November), hunting (October through March), and fruit gathering (July 
through August). 




Fig. 9. Comparison matrix for cyclic land use change. 
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5 Conclusions and Future Work 

Many phenomena, such as tides, beach erosion, and monsoons, change in a cyclic 
fashion. The semantics associated with cycles, however, have yet to be incorporated 
in conceptual data models. Current spatio-temporal data models are based on a linear 
model of time assuming total ordering and do not offer explicit support for cycles. If 
users know a priori that their data vary cyclically, this information needs to be 
captured in a database and needs support for queries on cyclic-based intervals, such 
that any cyclic variation is explicitly returned. 

A formalism of cycles based on cyclic intervals and (he relations between these 
intervals distinguishes a set of sixteen cyclic interval relations, not including cases of 
full or empty cycles. This systematic derivation shows that there are more than the 
nine relations identined in Frank [19]. Analysis of the conceptual neighborhood graph 
demonstrates that these sixteen relations are complete, such that no relation exists 
between the nodes of the graph. 

Study of the complete set of cyclic relations including the cases of empty and full 
cycles is underway. This work will include analysis of the conceptual neighborhoods 
associated with the complete set of cyclic relations. To enable more comprehensive 
cyclic reasoning it is necessary to establish the composition of the sixteen cyclic 
relations (e.g., A meets_twice B and B containedjby C implies A overlaps _tw ice C). 
Based on a method used for determining the composition of topological relations in 
IR^ [24], we will derive all 256 compositions for the cyclic relations. Of particular 
interest will be the crispness of these compositions as compared to the crispness of the 
compositions for linear intervals [18]. Further extensions to the model are also 
possible, for example, future work will include extending the model to accommodate 
cycles with different period lengths. 
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Abstract. It is widely recognized that temporal aspects of database schemas 
are prevalent, but also difficult to capture using the ER model. The database re- 
search community’s response has been to develop temporally enhanced ER mod- 
els. However, these models have not been subjected to systematic evaluation. In 
contrast, the evaluation of modeling methodologies for information systems de- 
velopment is a very active area of research in the information systems engineer- 
ing community. Based on a framework from information systems engineering, 
this paper evaluates the ontological expressiveness of three different temporal en- 
hancements to the ER model. The three temporal ER model extensions are well- 
documented, and together the models represent a substantial range of the design 
space for temporal ER extensions. 



1 Introduction 

Both the research community and the companies that design databases have recognized 
that temporal aspects of database schemas are both prominent and difficult to capture 
using the ER model. Intuitive and easy-to-comprehend diagrams become obscure and 
cluttered when modeling fully the temporal aspects. As a result, the research community 
has developed a number of temporally enhanced ER models [13,5, 18,4,2,23,22, 16, 
28,8]. 

Both the standard and temporally enhanced ER models may be used for different, 
but related purposes, namely for analysis — i.e., for modeling a part of reality — and for 
design — i.e., for describing the database schema of a computer system. The typical use 
seems to be one where the model is used primarily for design and where the constructed 
diagrams are mapped to a relational platform. In step with the increasing diffusion and 
use of relational platforms in industry, ER modeling is growing in popularity. 

In the database research community, the models that are offered for conceptual 
database design are rarely evaluated systematically. In contrast, in the area of infor- 
mation systems engineering, the evaluation of modeling methodologies for information 
systems development is a very active area of research. A substantial number of evalu- 
ations are reported in the literature [1, 14,20,6, 1 1, 19,24-27, 15], and the IFIP Work- 
ing Group 8.1 is co-sponsoring an annual workshop, EMMSAD, devoted solely to this 
topic. 

* This work was supported in part by grants 9502695 and 9700780 from the Danish Technical 
Research Council and grant 9701406 from the Danish Natural Science Research Council, as 
well as grants from the Nykredit corporation and the Danish National Centre for IT Research. 
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Weber and Wand have developed a framework for evaluating the ontological expres- 
siveness of information systems development methodologies [24-27]. This framework 
includes an ontological model of the real world that covers both structural and behav- 
ioral aspects. The framework has been used to evaluate the notations and associated 
semantics of three models for information systems development. 

The present paper uses the approach of Weber and Wand for evaluating three dif- 
ferent temporal extensions to the ER model. Specifically, the objective of this paper is 
to evaluate the ontological expressiveness of the temporal notational constructs of three 
selected temporal ER models [23,28,8]. Each of these temporal ER model extensions 
is well-documented, and together the models represent a substantial range of the de- 
signs space for temporal extensions. The evaluation considers the use of the models for 
design only, since this is the typical use of the model. As a result, it is evaluated how 
well the models capture the temporal aspects of a database design. This necessitates the 
introduction of a representational model for a database design. The three models were 
chosen based on their recency and quality. One was published in 1991, and the latter 
two were published within the last two years and may be considered second-generation 
models in that they attempt to build on the earlier models. 

This work is related to four surveys and comparisons of methodologies for the 
analysis and design of information systems. Brandt [1] surveys and evaluates thirteen 
methodologies for system’s specification. Kung [14] studies three conceptual models 
with a time perspective. Floyd has [6] evaluated and compared three different system’s 
development methodologies. Jayaratna [12] has developed a framework, termed NIM- 
SAD, for understanding and evaluating methodologies. The evaluations above focus on 
modeling properties, the usage of the models, and the user-friendliness of the models. 
Only one evaluation considers expressiveness as a criterion [14], but does not consider 
the models’ abilities to express temporal aspects; in contrast, the evaluation in this paper 
focuses entirely on expressiveness in relation to temporal aspects. 

Conceptual models for database design have also been evaluated and compared. 
Schrefl et al. [20] develop a set of criteria for comparing conceptual models and evalu- 
ate seven conceptual models. Hull and King [11] discuss issues of conceptual modeling 
and survey sixteen conceptual models. Peckham and Maryanski [19] describe generic 
properties of conceptual models and survey a representative selection of models. Le- 
ander et al. [15] compare the modeling capabilities of the ER model and the NIAM 
methodology. The evaluations of the conceptual models for database design all focus 
on non-temporal properties. This paper’s evaluation focuses exclusively on the model- 
ing of temporal aspects. 

Ten temporally extended ER models have been surveyed and evaluated by the au- 
thors [7, 10]. The focus of these evaluations are entirely on model properties, and crite- 
ria based on ontologies are not considered. 

In summary, the focus of previous, related evaluations ranges from determining 
the environments in which methodologies where developed, over the usages of the 
methodologies, to the user-friendliness of the methodologies. Some studies examine 
the expressiveness of conceptual data models [14, 15]. However, previous work does 
not consider evaluation parameters that concern how well the models capture temporal 
aspects, which is the topic of this paper. We evaluate three different temporally extended 
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ER models’ abilities in describing the temporal aspects of relational database schemas 
capturing temporal aspects. A substantially extended version of this paper considers 
also the use of the three models for analysis [9]. 

The paper is structured as follows. Section 2 states the objectives of the paper’s 
evaluations, and Section 3 presents the evaluation framework. In Section 4, the three 
temporal ER extensions are evaluated. Finally, Section 5 summarizes the findings and 
outlines directions for future research. 

2 Evaluation Objectives 

This section first describes the context of the paper’s evaluation, by outlining several 
aspects of a conceptual model that may be subjected to evaluation. It then proceeds to 
state the objective of the paper’s evaluation, by positioning it within this context. 

2.1 l^pes of Evaluation 

Three different approaches to evaluating a conceptual model can be taken. First, the 
evaluation can be done by examining the diagrams that result from using the conceptual 
model, i.e., the diagrams will be the target of the evaluation. Second, the notation, 
i.e., the “building blocks” of the conceptual model can be evaluated by examining, 
e.g., which notational constructs the model offers for modeling specific aspects of the 
underlying implementation. Third, the methods and guidelines that describe how to use 
the model during the design of a database can be evaluated. 

The difference between evaluating the resulting diagrams versus the notation of a 
temporal ER model may be explained as follows. A temporal ER model is a graphical 
model, that is, the notation of the model is graphical symbols, including rectangles, di- 
amond, and lines. Each symbol has a specific interpretation (semantics) that gives the 
meaning of the symbol. In contrast, a diagram is a connected collection of symbols. 
Evaluating a conceptual model by considering specific diagrams produced using the 
model versus evaluating it by considering each of its modeling constructs yields differ- 
ent evaluations. We proceed to discuss the three types of evaluations in more detail. 

One can evaluate a diagram with respect to analysis by observing how well the 
diagram describes the underlying database structure. This means that it should be pos- 
sible for database designers to recognize the structure of the underlying database by 
examining the diagram. The notation of the model is irrelevant to the evaluation of a 
diagram, since the focus is entirely on how easy it is to recognize the database structure 
by examining the diagram. 

Next, the notation of a conceptual model can be evaluated with respect to design by 
by examining whether or not the modeling constructs offered by the model can describe 
database structure with the desired accuracy. In order to do this, we have to examine 
which modeling constructs the underlying data model offers and and then to determine 
which modeling constructs the conceptual model offers for describing these constructs 
of the underlying model. 

Finally, the methods and guidelines for how to use the model in the design phase 
can be evaluated with respect to how well they help and support the construction of the 
schema of the underlying database. 
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2.2 Choice of Evaluation Objectives 

We have chosen not to base our evaluation on specific temporal ER diagrams, for sev- 
eral reasons. First, it is not clear who should create the diagrams. It is almost impossi- 
ble to ensure that the corresponding diagrams for several different temporal ER models 
are created under similar conditions. Solving this problem by having the same per- 
sons create all diagrams introduces new problems: the sequence in which the diagrams 
are created is likely to matter, so that later diagrams are influenced by knowledge ob- 
tained during the creation of earlier diagrams. Second, the evaluation of how well a 
ER diagram documents the underlying database is quite subjective. Different database 
designers may have different answers to whether or not a digram documents the under- 
lying database. We have also chosen not to evaluate the methods and guidelines that 
the designers of the temporal ER models may have given, for the simple reason that no 
temporal ER model is equipped with such methods and guidelines. 

Rather, we have chosen to evaluate the notations of three temporal ER models. The 
evaluation occurs within a framework originally developed for evaluating the notations 
of methodologies for information systems development. Therefore, a new ontology to 
be used for evaluating the temporal ER model is provided. 



3 An Ontologically-Based Evaluation Framework 

This section presents the overall evaluation framework and then the ontology to be used 
when evaluating the temporal ER models. 



3.1 Overall Framework 

One part of Wand and Weber’s framework for evaluating the ontological expressive- 
ness of the notations of methodologies for analysis and design of information systems 
(ISAD) [24-27] is a representational model of the real world. This model consists of all 
the real-world constructs, called ontological constructs, that an ISAD notation must be 
able to capture in order to model the real world. Since our evaluation focuses on tem- 
porally extended ER models’ capabilities in modeling temporal aspects of a database 
design, we develop our own representational model. 

The evaluation of whether or not a notation is ontologically expressive is based on 
the notion of mathematical mappings. The focus is on two sets: the set of ontological 
constructs and the set of notational constructs offered by the model to be evaluated. 
Two mappings exist between these, the representation mapping, which maps ontolog- 
ical constructs to corresponding notational constructs, and the interpretation mapping, 
which maps notational constructs to corresponding ontological constructs, see Fig. 1 . 

Although based on the precise mathematical notion of mappings between sets, the 
evaluation involves a certain degree of subjectiveness because the construction of the 
specific mappings necessitates some interpretation on the part of the evaluator. 

Informally, the notation of a model is ontologically complete if the notation can rep- 
resent the same information as the representational model (the ontological constructs); 
otherwise, it is ontologically incomplete and suffers from construct deficit [26]. That is. 
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Representation Mapping 




Fig. 1. Sets and Mappings in the Evaluation Framework [26] 

a notation is ontologically complete if the representational mapping from the ontologi- 
cal constructs to the notational constructs is total. 

Next, ontological clarity concerns the interpretation mapping from the notational 
constructs to the ontological constructs. Three situations can obscure the clarity of a 
notation. First, if one notational construct can be mapped into more than one onto- 
logical construct, this implies construct overload. Second, if more than one notational 
construct can be used to model the same ontological construct, the notation suffers from 
construct redundancy. Third, if there are notational constructs that do not represent any 
ontological constructs, the notation suffers from construct excess. 

3.2 Ontology 

When a conceptual model is used for database design, the predominant target model is 
the relational model, and we assume that the target model is this model or a temporal 
extension of it. 

A database stores information about objects from a modeled reality, which consists 
of objects. The time during which an object exists in the reality, we call the existence 
time of the object. An object is characterized by its properties. The time during which 
it is true (in the modeled reality) that a property has a specific value is called the valid 
time of that particular value. 

A relation may capture the existence or valid time of its tuples by using time at- 
tributes defined over an appropriate time domain. The time during which a tuple is 
current in the database is called the transaction time of the tuple, and a relation may 
also include time attributes that capture this aspect. Next, user-defined constraints can 
be a defined over the relations, e.g., how many tuples in one relation are allowed to have 
references to the same tuple in another relation. The constraints that must hold for all 
points in time, we will call temporal, while we will term the constraints that must hold 
at each point in time in isolation snapshot constraints. 

4 Model Evaluation 

This section examines the ontological completeness and clarity of three temporally ex- 
tended ER models. The notations of the three models are presented by diagrams mod- 
eling a company database. 
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4.1 The ERT Model 



The first model to be examined is the Entity-Relation-Time (ERT) model [23]. 




Fig. 2. An ERT Diagram Describing a Company Database 

The model supports lifespans of entities and the valid time of relationships. Fig. 2 is 
an ERT diagram modeling a company database. A rectangle represents an entity class or 
a value class (black triangle placed in the bottom-right corner). Entity and value classes 
can be specified as complex by using a double rectangle. An entity class expanded with 
a “timebox” containing the symbol T is specified as temporal. Relationship classes are 
denoted by small filled rectangles, and only binary relationships are available. These 
can be specified as temporal by expanding the filled rectangle with a timebox. For each 
entity (or value) class participating in a relationship class, an involvement role and a 
cardinality constraint must be specified. An ISA relationship class is denoted by a circle 
with arrow(s) flowing from the subclass(es) to the circle and an arrow flowing from the 
circle to the superclass. 

The result of the examination of ERT with respect to ontological completeness is 
presented in Table 1. It can be seen that ERT suffers from three, possibly four, cases of 
construct deficit. First, the model has no notation for specifying time domains. Second, 
the model does not support transaction time. Third, no notation is offered for specifying 
temporal constraints. Fourth, since the semantics of the construct offered for specifying 
cardinality constraints is unclear, it cannot be determined with certainty if the cardinal- 
ity constraints is a snapshot constraint. 

The result of the examination of ERT with respect to ontological clarity is presented 
in Table 2. The model does not suffer from construct redundancy. It is unclear if ERT 
suffers from construct excess due to the unclear semantics of the constructs offered 
for specifying constraints. The model suffers from one cases of construct overload: the 
timebox models both lifespan for entity types and valid time for attributes. 
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Table 1. Evaluating ERT With Respect to Ontological Completeness 



Ontological Construct 



ERT Representation 



Time domains 
Lifespan attributes 
Valid time attributes 

Transaction-time attributes 
Temporal user-defined con- 
straints 

Snapshot user-defined con- 
straints 



Not represented. 

Represented by the timebox extension of entity classes. 
Represented by the timebox extension of user-defined relation- 
ship classes. 

Not represented. 

Not represented. 

Represented by the placing the min-max constraint near the entity 
type (or value class) participating in the relationship class. 



Table 2. Examining ERT With Respect to Ontological Clarity 
ERT Construct Ontological Construct 

Timebox Models lifespan attributes and valid-time attributes. 

Cardinality constraint Might model a snapshot user-defined cardinality constraint. 

Superclass/subclass completeness Might model a snapshot user-defined generalization corn- 
constraint pleteness constraint. 

Superclass/subclass disjointness Might model a snapshot user-defined constraint, 
constraint 



4.2 The TERC-h Model 



This section evaluates the TERC-i- model [28], which supports lifespans of entity types 
and valid time of relationship and attribute types. Fig. 3 presents a TERC-t diagram 
modeling a company database. 




Start_date Type 



Fig. 3. A TERC-r Diagram Describing a Company Database 

An entity type is represented by a rectangle, and a relationship type is denoted by 
a rectangle with rounded corners. Entity types and relationship types can be annotated 
with a clock symbol to indicate that their life cycles are to be captured. Attributes are 
represented by plain text linked to entity or relationship types and can be annotated 
with a clock symbol to indicate that valid time is to be captured. Cardinality constraints 
are expressed using the lines connecting the entity types to the relationship types. A 
historical cardinality constraint, h{max), can be expressed for relationship types. A 
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part_of/component_of relationship type is denoted as a relationship type annotated with 
a filled diamond and an arrow pointing to the component. 

Table 3 characterizes the ontological completeness of TERC+. The model suffers 
from three cases of construct deficit. First, there is no notation available for specifying 
time domains. Second, there is no notation offered for specifying the transaction time 
of attributes. Third, temporal user-defined constraints cannot be specified. 

Table 3. Evaluating TERC+ With Respect to Ontological Completeness 

Ontological Construct TERC+ Representation 

Time domains Not represented. 

Lifespan attributes Represented by annotated entity types with a clock symbol. 

Valid-time attributes Represented by attributes and relationship types annotated 

with a clock symbol. 

Transaction-time attributes Not represented. 

Temporal user-defined constraints Not represented. 

Snapshot user-defined constraints Represented by constraint (min, max) that has to be speci- 
fied for each entity type participating in a relationship type. 



The result of examining the ontological clarity of TERC-i- is presented in Table 4. 
The model does not suffer from construct redundancy, but from construct overload, 
since the clock symbol models both valid-time and lifespan attributes. Construct excess 
occurs because the historical cardinality constraint cannot be mapped to any ontological 
construct. 



Table 4. Examining TERC+ With Respect to Ontological Clarity 



TERC+ Constmct 
Clock symbol 
Cardinality constraint 
Historical cardinality constraint 
Total generalization constraint 

Exclusive generalization constraint 



Ontological Construct 
Model valid-time or lifespan time attributes. 

Models a snapshot user-defined cardinality constraint. 

No corresponding ontological constructs. 

Models a snapshot user-defined generalization complete- 
ness constraints. 

Models a snapshot user-defined generalization disjoint- 
ness constraints. 



4.3 The Timber Model 

The last model to be evaluated is the Time Extended ER (TimeER) model [8]. This 
model offers support for lifespans of entity types; valid time for attributes and rela- 
tionship types; and transaction time for entity types, relationship types, and attributes. 
Figure 4 presents a TimeER diagram modeling a company database. The model ex- 
tends the notation of the ER model with annotations to indicate which temporal aspects 
are to be captured. The annotations are LS, indicating lifespan support, VT indicating 
valid-time support, TT indicating transaction-time support, LT indicating lifespan and 
transaction-time support, and BT indicates valid- and transaction-time support. 

Table 5 shows that the TimeER model is ontologically complete with respect to the 
relational ontology. 

Considering the ontological clarity of the TimeER model with respect to the re- 
lational ontology, it follows from Table 6 that the model does not suffer construct 
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Table 5. Evaluating TimeER with Respect to Ontological Completeness 



Ontological Construct 
Time domains 



Lifespan attributes 

Valid-time attributes 

Transaction-time 

attributes 

Temporal user-defined 
constraints 

Snapshot user-defined 
constraints 



TimeER Representation 

The domain of a time attribute is indicated by the annotation used. 
If the annotation is an LS then the time domain is the lifespan time 
domain. 

Entity types annotated with an LS (LT) model the presence of lifespan 
attributes. 

The annotations VT and BT indicate the presence of valid-time at- 
tributes. 

The annotations TT, LT, and BT indicate the presence of valid-time 
attributes. 

These are represented by the lifespan participation constraint [min, 
max] for relationship types. 

These are represented by the snapshot participation constraint (min, 
max) for relationship types. 

Represented by the sujjerclass/subclass completeness constraint. 
Represented by the superclass/subclass disjointness constraint. 



redundancy. However, it suffers from one case of construct overload since the super- 
class/subclass completeness constraint models both temporal and snapshot user-defined 
generalization completeness constraints. The model does not suffer from construct ex- 
cess. 

5 Summary and Research Directions 



At the outset, the paper presents an overall framework for examining the ontological 
expressiveness and clarity of temporal ER models. The framework includes an ontol- 
ogy which is used as a so-called representational model. This ontology describes the 
constructs of a relational database schema, which serve a role in capturing the temporal 
aspects of data. 
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Table 6. Examining the TIMBER With Respect to Ontological Clarity 



TimeER Construct 


Ontological Construct 


LS 


Models lifespan attributes. 


VT 


Models valid-time attributes. 


TT 


Models transaction-time attributes. 


LT 


Models lifespan and transaction-time attributes. 


BT 


Models valid-time and transaction-time attributes. 


Snapshot participation constraints 


Model snapshot user-defined constraints. 


Lifespan participation constraints 


Model temporal user-defined constraints. 


Superclass/subclass completeness 


Model temporal and snapshot user-defined generalization 


constraints. 


completeness constraints. 


Superclass/subclass disjointness 


Model snapshot user-defined generalization disjointness 


constraints. 


constraints. 



The framework concerns the mappings between the ontological constructs and those 
of temporal ER models. The framework is used for evaluating three temporal extensions 
of the ER model, namely the ERT, the TERC+, and the TimeER model, with respect 
to their use for design of relational databases managing time- varying data. 

All three models offer support for capturing lifespan and valid-time attributes and 
snapshot constraints (although the semantics of ERT’s constraints are unclear). Only 
TimeER is able to model transaction-time attributes and temporal constraints. The 
overall result is that no model is ontologically expressive with respect to the ontology. In 
addition, the ERT and TERC+ models are ontologically incomplete and ontologically 
unclear. The TimeER model is ontologically complete, but also ontological unclear. 
The model supports all the three temporal aspects considered. As a result, TimeER 
makes it possible to model the temporal aspects, as covered by the ontology, of a rela- 
tional database schema. With the ERT and TERC+ models, it is not possible to model 
all the temporal aspects a relational database schema. This leads to the conclusion that 
these two models do not fully support the conceptual design of databases managing 
time-varying data, since transaction-time support is frequently necessary in applica- 
tions. 

The research reported in this paper points to several directions for future research 
that deserve further attention. 

It is recommended that the ERT and TERC+ models be enhanced with support for 
transaction time. Next, it appears relevant to consider extending the ERT and TimeER 
models with support for modeling dynamic aspects of reality; TERC-i- already offers 
some such support. In addition, new notation supporting the capture of database be- 
havior might be introduced. Indeed, the idea of being able to capture in the conceptual 
model the evolution of a database schema appears to be very appealing. That is, it should 
be studied how to extend these and other conceptual models with notational constructs 
that conveniently capture the evolution of the database schema over time. 

As another topic, the evaluation framework itself might be enhanced. The outcome 
of an evaluation is quite sensitive to the ontology employed, making it an interesting 
direction to expand the ontology to capture better the structural aspects not related to 
time, and perhaps also to capture dynamic aspects. It might also be of interest to develop 
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a consensus ontology or to have an ontology be developed independently, by users of 
conceptual models. 

Also, the mappings between notational constructs and ontological constructs could 
be generalized to include a degree of satisfaction rather than always assuming 100% 
satisfaction. The historical participation constraint of TERC+ — which limits the maxi- 
mum number of relations, but not the minimum number — offers an example of a partial 
satisfaction of a temporal constraint. Evaluations based on these more sophisticated 
mappings might yield a richer picture of the models. On the other hand, the assignment 
of degrees may prove very subjective. 

The type of evaluation conducted indicates that the models evaluated are very sim- 
ilar, in that if only a few constructs are added to each of the three models, they would 
be isomorphic to each other. We feel that this apparent similarity is to some extent an 
artifact of the evaluation framework, which is unable to, e.g., discern significant lin- 
guistic differences. To obtain a more complete understanding of the models and their 
similarities and differences, more evaluations based on other criteria are recommended. 

More generally, it is felt that there is a need for more methods that systematically 
evaluate and compare extended ER models. Such methods are likely to prove useful to 
both the designers of new conceptual models and the users of such models. 



Acknowledgments 

The authors would like to thank the anonymous reviewers for their particularly insight- 
ful comments. 



References 

1. I. Brandt. A Comparative Study of Information Systems Design Methodologies. In 
T. W. Olle, H. G. Sol, and C. J. Tully, editors, Information Systems Design Methodologies: 
A Feature Analysis, pp. 9-36. Elsevier Science Publishers, 1983. 

2. R. Elmasri, I. El-Assal, and V. Kouramajian. Semantics of Temporal Data in an Extended ER 
Model. In 9th International Conference on the Entity -Relationship Approach, pp. 239-254, 
1990. 

3. R. Elmasri and S. B, Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 2. 
edition, 1994. 

4. R. Elmasri, G. Wuu, and V. Kouramajian. A Temporal Model and Query Language for EER 
Databases. In A. Tansel et al., editor. Temporal Databases: Theory, Design, and Imple- 
mentation, Chapter 9, pp. 212-229. Benjamin/Cummings Publishers, Database Systems and 
Applications Series, 1993. 

5. S. Ferg. Modeling the Time Dimension in an Entity-Relationship Diagram. In 4th Interna- 
tional Conference on the Entity-Relationship Approach, pp. 280-286, 1985. 

6. C. Floyd. A Comparative Evaluation of Systems Development Methods. In T. W. Olle, 
H. G. Sol, and A. A. Verrijn-Stuart, editors. Information Systems Design Methodologies: 
Improving the Practice, pp. 19-54. Elsevier Science Publishers, 1986. 

7. H. Gregersen and C. S. Jensen. Temporal Entity-Relationship Models — a Survey. IEEE 
Transactions on Knowledge an Data Engineering, 1 1(3):464-497, 1999. 

8. H. Gregersen and C. S. Jensen. Conceptual Modeling of Time- Varying Information. Tech- 
nical Report TR-35, TimeCenter, 1998. 




Ontological Expressiveness 121 



9. H. Gregersen. Temporally Enhanced Database Design. Ph.D. Thesis, Department of Com- 
puter Science, Aalborg University, 1999. 

10. H. Gregersen, C. S. Jensen, and L. Mark. Evaluating Temporally Extended ER Models. 
In K. Siau, Y. Wand, and J. Parsons, editors. Proceedings of the Second CAiSE/IFIPS.l 
International Workshop on Evaluation of Modeling Methods in Systems Analysis and Design, 
12 pages, 1997. 

11. R. Hull and R. King. Semantic Database Modeling; Survey, Applications, and Research 
Issues. ACM Computing Surveys, 19(3);201-260, 1987. 

12. N. Jayaratna. Understanding and Evaluating Methodologies: NIMSAD - A Systematic 
Framework. Information Systems, Management and Strategy Series. McGraw-Hill, 1994. 

13. M. R. Klopprogge and P. C. Lockeman. Modeling Information Preserving Databases; Con- 
sequences of the Concept of Time. In 9th International Conference on Very Large Data 
Bases, pp. 399-416, 1983. 

14. C. H. Kung. An Analysis of Three Conceptual Models with Time Perspective. In T. W. Olle, 
H. G. Sol, and C. J. Tully, editors. Information Systems Design Methodologies: A Feature 
Analysis, pp. 141-167. Elsevier Science Publishers, 1983. 

15. A. H. F. Laender and D. J. Flynn. A Semantic Comparison of the Modeling Capabilities 
of the ER and NIAM Models. In R. Elmasri, V. Kouramajian, and B. Thalheim, editors, 
Entity-Relationship Approach — ER '93, Volume 823 of Lecture Notes in Computer Science, 
pp. 242-256. Springer Verlag, 1993. 

16. V. S. Lai, J.-P. Kuilboer, and J. L. Guynes. Temporal Databases; Model Design and Com- 
mercialization Prospects. DATA BASE, 25(3):6-\%, 1994. 

17. P. McBrien, A. H. Seltveit, and B. Wangler. An Entity-Relationship Model Extended to 
Describe Historical Information. In International Conference on Information Systems and 
Management of Data, pp. 244-260, 1992. 

18. A. Narasimhalu. A Data Model for Object-Oriented Databases with Temporal Attributes and 
Relationships. Technical report. National University of Singapore, 1988. 

19. J. Peckham and F. Maryanski. Semantic Data Models. ACM Computing Surveys, 20(3); 153- 
189, 1988. 

20. M. Schrefl, A. M. Tjoa, and R. R. Wagner. Comparison-Criteria for Semantic Data Models. 
In J. K. Aggarwal, editor. First International Conference on Data Engineering, pp. 120-125, 
1984. 

21. R. T. Snodgrass. The Temporal Query Language TQuel, ACM Transactions on Database 
Systems, 12(2);247-298, 1987. 

22. B. Tauzovich. Toward Temporal Extensions to the Entity-Relationship Model. In lOth Inter- 
national Conference on the Entity Relationship Approach, pp. 163-179, 1991. 

23. C. I. Theodoulidis, P. Loucopoulos, and B. Wangler. A Conceptual Modelling Formalism for 
Temporal Database Applications. Information Systems, 16(4);401-416, 1991. 

24. Y. Wand and R. Weber. An Ontological Evaluation of Systems Analysis and Design Methods. 
In E. D. Falkenberg and P. Lindgreen, editors. Information Systems Concepts: An In-depth 
Analysis, pp. 79-107, 1989. 

25. Y. Wand and R. Weber. Toward a Theory of The Deep Structure of Information Systems. In 
J. I. DeGross, M. Alavi, and H. Oppelland, editors. International Conference on Information 
Systems, pp. 61-71, 1990. 

26. Y. Wand and R. Weber. On the Ontological Expressiveness of Information Systems Analysis 
and Design Grammars. Journal of Information Systems, 3(4);217-237, 1993. 

27. R. Weber and Y. Zhang. An Analytical Evaluation of NIAM’s Grammar for Conceptual 
Schema Diagrams. Information Systems Journal, 6(2):l47-\70, 1996. 

28. E. Zimanyi, C. Parent, S. Spaccapietra, and A. Pirotte. TERC-t-; A Temporal Conceptual 
Model. In International Symposium on Digital Media Information Base, 1997. 




Semantic Change Patterns in the Conceptual Schema 



L.Wedemcijer 

ABP Netherlands, Department of Information Management 
e-mail: L.Wedemeijer@ABP.nl 



Abstract. Conceptual schemas are described using some preferred data model 
theory. A taxonomy of potential changes can be derived f^rom that data model 
theory, but it has no great significance for the maintenance and evolution of a 
given schema because it is based only on the constructs of the data model 
theory. This current state of the art in conceptual modelling can be improved by 
learning from actual business cases. Tliis paper describes a number of semantic 
change patterns observed in operational conceptual schemas. Several semantic 
changes are not accounted for in current taxonomies. It shows that studying 
evolution in operational schemas opens up new and promising ways of 
improving data model theories and current design practices. 



1. Introduction 

Enterprises spend much effort in the design of a high-quality Conceptual Schema 
(CS) as an abstract model of the relevant Universe of Discourse (UoD) (figure 1). 
Less attention is paid to maintenance once the schemas have been put into operation. 
It has long been recognized that there is no hope in 'picking the right way to organize 
the stored data, and it never has to change. Change is inevitable' [Tsichritzis77]. 




Fig 1 CS as abstract model of the UoD 
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Still, many contemporary theories and design practices for conceptual data modelling 
concentrate on a single design effort to deliver a conceptual schema once and for all. 
Indeed, many quality aspects of a given CS [Batini, Ceri, Navathe’92] can be verified 
at design time, but not its stability or flexibility [Kcsh'95], [Levitin, Redman'95]. 
Even though many data model theories and design strategies claim flexibility of their 
resulting CS, the potential for change in the CS and for absorbing the impact of 
change is not often demonstrated in operational CSs [Wedemeijer'99]. The literature 
is largely silent on changes that are observed in operational CSs. This paper tries to 
remedy this lack by providing a few examples, not aiming for completeness. It intends 
to abstract the know-how experience from maintenance, and to make that knowledge 
available to data administrators and researchers. The semantic change patterns have 
been abstracted from actual changes in operational conceptual schemas. To our 
knowledge, this is the first attempt to sy.stcmatically study changes that occur in 
operational schemas and to learn from them. 

Semantic change patterns differ from design patterns [Gamma'94] in several ways. 
Current design patterns are aimed at software- and object-oriented technology. More 
important, design patterns propose a design solution for a given problem which is 
static, while change patterns have their value in enabling the graceful evolution of 
data schemas, in accordance with a changing UoD. 

Semantic change patterns are a special kind of schema transformations [De 
Troyer'93]. Forward-engineering transformations and best-practices aim to create an 
efficient database schema once the CS has been established, without changing the 
semantics of the CS. Reverse-engineering transformations reconstruct a CS from a 
given legacy implementation schema [Hainault, Chandelon, Tonneau, Joris'93]. 
Semantic change patterns differ from these kinds of lossless transformations because 
change patterns do adjust the semantics of the CS (and database content). 
Schema-evolution as discussed in [McKenzie, Snodgrass'90], [Proper'97], and 
taxonomies [Batini, Di Battista'88], [Ewald, Orlowska'93] are primarily concerned 
with theoretical foundations to enable change in the CS. These approaches study 
potential changes in the constructs and constructions that are postulated in the data 
model theory. Knowledge of actual changes in operational CSs and the impact on data 
populations is not systematically incorporated in these approaches. 

Ontological approaches as proposed in [Wand, Monarchi, Parsons, Woo’95] are 
related to our work as ontologies can help in determining a level of abstraction for the 
CS that will not change. While this can be an excellent way to create a high-quality 
CS, no approach can be expected to deliver schemas that are free of change. 

There is some literature on the relation between organizational need for information, 
and the systems that deliver it [Keen'S!], [Goodhue, Wybo, Kirch92], [Barua, 
Ravindran96]. These sources refer to change drivers such as organizational change 
(mergers, diversifications etc), Business Process Reengineering projects that increase 
competitive edge, and technological innovations. How these change drivers are 
related to specific changes taking place in the CS of the operational information 
systems is not well understood. 
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2. Change in the CS 



Wc define a semanlie eliange lo be any change in llic operational CS that validly 
reflects change(s) in the information structure of the UoD. A CS change is not 
semantic if nothing in the structure of the UoD has changed that might explain it. This 
definition is not very precise, but it does allow to decide which CS changes are 
semantical and which are not by looking for the corresponding change drivers in the 
UoD. While the definition assumes a causal relationship between changes in UoD and 
CS, it does allow for delay (changes need not coincide) and for a single UoD change 
to alter many aspects of the CS. The change patterns that we will describe are 
abstractions of changes that satisfy the above definition. The resulting patterns are 
aimed at adjusting the CS semantics (and database content) to the changing 
requirements and information needs in the business environment. 

To detect change, the CS ‘before’ and ‘after’ must be known. Tliere arc many 
different data model theories that a designer can choose from to document his or her 
CS. A rich theory [Saiedian97] allows UoD structures to be repre.sented in several 
ways. This is for instance (lie ca.se in UML or EER data model Ihcorics. A switch 
between such equivalent representations needs not rellect a semantic change in the 
UoD. We will therefore use an essential, as opposed to rich variant of the Entity- 
Relationship data model theory. It will not allow many-to-many relationships as these 
can equally well be represented using ‘intersection’ entities. Further details on the 
data model theory -and its taxonomy- must be omitted for the sake of brevity, it 
suffices to say that it has been applied in a consistent way in all business cases. 

Changing an operational CS is no small matter. The consequences for the stored data 
must be carefully considered. Existing business procedures, user interfaces, 
applications etc all have to be reviewed to determine the full impact. Because 
enterprises try to keep the impact of change as small as possible, the adapted CS is 
required to be a good model of the new UoD, and at the same time to be 'as close to 
the old CS as possible'. This usually translates into the demand for compatibility, thus 
reducing the need for complex data conversions and application reprogramming. 
When a schema is changed, data administrators and designers must be on guard for 
undocumented features. It is a known problem in legacy databases that data is 
sometimes stored in unexpected ways using the available constructs and 
constructions. This causes the formally documented CS to misrepresent the actual 
semantics of the data. Database reverse-engineering techniques have been used to 
derive a correct CS from the operational database content and structure. 

Considering these practical problems, it is evident that change in the CS can't be 
studied in isolation, i.e. from the taxonomy only. The behaviour of operational 
schemas must be studied in their natural business environment. It must be stressed 
that change in the schema does not imply that 'something went wrong’ in design. 
Every schema, no matter how well-designed and no matter which data model theory 
was used, can and will be subject to change. While a quality design might lower the 
number and the impact of changes, it cannot prevent them altogether. As prevention 
of change is not the subject for this paper, we bypass the problem of determining 
‘static’ quality of a CS. Instead, we focus on semantic change in the CS. 
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3. Semantic change patterns 

This section gives an infoimal description of 7 semantic cliangc patterns that have 
been observed in CSs of operational information systems. The paper is restricted to 
semantic changes in entities and relationships; the -much more common- changes in 
the attribute- and constraint-constructs are not considered. 

3.1 Entity insertion 

This change (figure 2) demonstrates extension of the UoD, and extensibility of the CS 
[Atkinson, DeWitt, Maicr, Bancilhon, Dittrich, Zdonik VO]. It is commonly listed in 
taxonomies, often being divided in two seemingly unrelated steps: inserting the entity, 
and inserting its relationships. We find that the semantic change in operational 
schemas is always accomplished in a single maintenance effort. 



Purpose 




To capture information on newly relevant UoD objects 


Change 


In the CS 


A new entity is introduced, and relationships with existing 
entities are set up (also, referential-integrity constraints 
are introduced). We noticed that most insertions concern 
entities that have only a single relationship in the CS. 
Insertion of a new entity with several relationships is rare 



In the data Ordinarily, no change in existing data is needed. But 
population when the new entity features more than one relationship, 
occurrences of existing entities can be affected by new 
referential-integrity constraints 

Driver(s) Several different change drivers for entity insertions have 

been discerned from the case studies; 

- extension of the UoD to cover adjacent aspects 

- view integration, i.e. incorporating facts that belong to 
another view on the same UoD (e.g. financial data in a 
product database) 

Impact For DA The new entity was often already referred to implicitly, 
e.g. in text. Any such implicit references must 
converted into valid foreign-key references 

For best- Two aspects arc important for best-practices; 
practices a. to determine a hiost natural' UoD boundary, thereby 
anticipating on future change in the schema and thus 
enhancing its stability 

b. to integrate other views in the UoD whenever this 
adds value for current and future users of the CS 
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Fig 2 Entity insertion 



Fig 3 Entity elimination 



3.2 Entity elimination 

Like entity insertion, elimination (figure 3) demonstrates flexibility of the CS and is 
listed in most taxonomies. Entity elimination is often seen as a side-issue of more 
important changes in the CS, e.g. to boost project profitability. As the entity has lost 
its importance, there will be no sponsor to pay for its elimination! 



Purpose 



To save cost 



Change 



In the CS 



An existing entity and its relationships are dropped from 
the CS. References to it are either deleted, or they are 
reduced to plain attribute types. In our case studies, 
entities having more than 1 relationship in the CS were 
never dropped. In this sense, entities with just 1 
relationship are ‘at the fringe’ of the CS 



In the data 
population 



Data is deleted. Foreign-key values in other entities must 
be adjusted. Ih our business cases, every entity that had 
only one instance was eliminated 



Drivcr(s) 



The UoD concept has become so much less important that 
it is no longer worthwhile to represent it in the CS. But 
this docs not mean that the concept ceases to exist in the 
UoD 



Impact 



For DA 



Check that every entity in the CS is relevant to enough 
users in lire organization now and in the near future, 
especially for entities that are ‘at the fringe’ of the CS 



For best- 
practices 



An entity cant be eliminated from the schema if its key 
was propagated into weak-entity keys for other entities. If 
an entity might ever be eliminated, then don't incorporate 
its key into weak-entity keys 
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3.3 Entity replacement 

This change pattern (figure 4) reflects a shift in the UoD boundary, bringing detailed 
data into the CS that was previously only available in aggregate form. The change is 
not reported in most taxonomies, perhaps because the property of being 'source' or 
'derived' data is supposed to be a Fixed and unchangeable property of data. 

Purpose To replace derived data by source data 

Change In the CS An existing entity is replaced by a new entity (or possibly 
several) while reducing the old entity to a userview. The 
relationships of the old entity are shifted accordingly 

In the data New data values and instances are recorded. The data 
population already existed as it was used to derive the information 
content of the fonner entity. It was not yet within the 
scope of the old UoD. After the change, it is 

Drivcr(s) The change is driven by both the need and the possibility 

to have more detailed information. Tlic information need 
probably existed some time before the change was 
actually made, but it was only partially met by using 
derived data, because of operational, technological, or 
other limitations. When the limitations are overcome by 
some change in the enterprise, it is possible to record and 
manage the tnore detailed information and to bring the 
derivation process within the scope of the UoD. The data 
that is output of the derivation process is replaced by its 
source data, making the output data redundant 

Impact For DA Care must be taken to ensure that the new data is fully 
compatible with data that was previously recorded for the 
now-redundant entity. Also, as relationships shift from the 
old entity onto the new entities, the corresponding 
foreign-key values must be converted. In our case studies 
the old entity 'wasn't deleted from the internal schema, to 
prevent the need of changing old software and minimize 
the impact of change. Thus the change increases 
redundancy in the stored data 

For best- Three aspects arc important for best-practices: 
practices a. determine the 'most natural' UoD boundary, thereby 
anticipating on future change in the schema and thus 
enhancing its stability 

b. discuss the (technological) means of data acquisition, 
and be aware of technological advance 

c. consider granularity of all data, i.e. consider refining 
an entity in the CS if it contains aggregate data that is 
relevant elsewhere in the company 
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3.4 Make an entity state-aware 

This change (figure 5) adds the time dimension, and redefines the data lifecycles of a 
snapshot entity by separating it into two new entities. The pattern prevents state 
changes from propagating down to occurrences of dependent entities, thus limiting the 
impact of change. In our essential data model theory, this is the only way to make an 
entity state-aware. Temporal data model theories [Jensen, Snodgrass’96] offer other 
solutions. 



Purpose 




To capture more semantics on the same UoD 


Change 


In the CS 


The upper entity represents the lifetime existence' of the 
old entity. Time-varying attributes of the former entity are 
'pushed down' into the lower entity that is state-aware and 
contains a valid-time stamp (or sequence-number). The 
occurrences of this state-aware entity must exactly cover 
the ‘lifetime eyistence’ represented by the upper entity 


In the data 
population 


All current data must be divided into once-in-a-lifetime- 
data and state-aware data 


Driver(s) 




Users will expect historical data to be available even 
when they have agreed to a schema that only reflects a 
current state of affairs. From their point of view, historic 
data is a minor extension on existing information needs 


Impact 


For DA 


This change should be approached with caution as it has a 
large impact, and user access to the data can become more 
complicated 


For best- 
practices 


For every entity (and relationship) in a schema design, the 
relevance of it being state-aware should be considered 
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Fig 6 Promote specialization 



I<ig7 Dissolve generalization 



3.5 Promote specialization into full entity 

This pattern (figure 6) also separates an entity in two, but for other reasons. 



Purpose 



Change 



To model an important specialization in a better way 



In the CS An entity that was previously modelled as a special case 
of a more general concept, is now recognized as an entity 
in its own right, possibly with its own primary key 
attribute 



In the data The instances of the common generalization are moved 



population 



Driver(s) 




Impact 



For DA 



For best- 
practices 



into either the promoted specialization, or they stay 
behind in the restricted generalization 



When the objects in the UoD that are represented by this 
specialization become more important in the organization, 
various attribute types will be added to this specialization. 
Also, users will demand easier access to that data (they 
don’t want to select relevant occurrences every time). In 
the end, this spccializjition differs so much from the other 
participants in the generalization that the change is 
inevitable. Notice that the generalization does not become 
invalid, it just becomes less important in the CS 



The primary key of the former entity will often be kept in 
the new entity for compatibility reasons. Thus the former 
key-integrity constraint will change into a uniqueness 
constraint across two separate entities 



When creating a generalization, the relative importance of 
its various specializations must be considered. If one is 
much more important then the others, don’t generalize 
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3.6 Dissolve generalization 

riiis scmanlic change paUcni (lignie 7) al (he CS level is soiiiewhal similar (o riill 
horizontal Iragmentation of a relational table. In the CS, this pattern is an extension of 
the previous pattern while on the UoD level, the change drivers are different. 

Purpose To simplify the CS 

Change In the CS A generalized entity in the CS is dissolved, and all of its 
specializations are promoted into independent entities. 
The attributes of the former generalization are relocated to 
one, or perhaps several of the new entities. But the old 
primary key has no more use and is probably deleted. All 
relationships have to be redefined; in our case studies, 
each was relinked to exactly one of the new entities 

In the data Every instance of the gcncraliziUion has to be moved into 
population exactly one of the now-independent specializations 

Drivcr(s) Users expect the simple and stable UoD structure to be 

represented in an equally simple CS structure. 
Generalization in the CS allows flexibility w.r.t. change in 
UoD structures, but it introduces more complex data 
access and data maintenance. When the UoD structure is 
fixed, flexibility is not needed. Dissolving the 
generalization will simplify data updates and queries 

Impact For DA If the generalized entity had a type-attribute to 
discriminate its various specializations, then all values of 
that attribute will be lifted from the level of database 
content and be hardcoded' on the CS level 

For best- Several aspects are important: 

practices a. consider stability of the schema. A generalization is not 
needed if occurrences can’t change their type, and no 
new speci^izations are to be expected 

b. weigh simplicity against stability and understandability 
of the .schema. Generalization will improve stability 
and understandability, but will lower simplicity and 
ease of access to the data 

c. check the number of specializations. A large number of 

specializations makes the use of a generalized entity 
more attractive 
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3.7 Dissociate entity according to role 

This change pattern (figure 8) separates an entity into 2 (or more) new entities for yet 
other reasons. The pattern is presented in [Batini, Ceri, Navathe'92] (page 59). 



Purpose 




To distinguish between roles in a way that was formerly 
not relevant in the UoD 


Change 


In the CS 


One entity and all its relationships seem to be duplicated 
in the schema, but the semantics of the new entities and 
relationships differ. Consider for example a sales-and- 
service business that decides to reorganize each branch 
office into cither a sales outlet, a repair shop, or both 


In the data 
population 


Instances and attribute-values of the former entity are 
moved into one, or possibly both new entities, depending 
on the role(s) that the former instance played 


Driver(s) 




Before the change, the two roles for the entity (and its 
relations) were naturally associated with each other. 
Change in (he environment -legislation, new products, 
reorganization etc- causes that association to be viewed 
differently, driving the need for another CS structure 


Impact 


For DA 


All dependent entities need to be relinked to one, or 
possibly several occurrences of the now dissociated 
entities depending on new business rules 


For best- 
practices 


Prior to change, the entity and its relationships in the CS 
are known to integrate several roles. But the entity is not 
viewed as a generalization because distinguishing these 
roles isn't impOTtant. The change makes the distinction of 
roles in the UoD relevant to the CS 
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4. Discussion and conclusions 

This paper discussed a number of semantic change patterns that were abstracted from 
changes observed in operational CSs. Of course, our list is far from exhaustive. Still, 
it provides some insights that can contribute to the field of CS maintenance and 
design. However well designed, a CS depends for a large part on a 'current best view'. 
This is demonstrated by several change drivers that correspond to the ‘engineering 
abstractions’ arrows in figure 1: 

- entity replacement in tlie CS; reflecting the need, or perhaps the possibility to 
extend the UoD and record more detailed information whenever that information 
comes within reach for the users in the business environment 

- promotion of a specialization into a full entity; reflecting a structural change in the 
relative importance of objects in the UoD 

- dissolving a generalization that is considered too cumbersome; reflecting the need 
for a simple and understandable model of the UoD 

There are also implications fur data model theories and their related taxonomies. 
Several changes commonly listed in taxonomies were not seen in the change patterns. 
It may be that too few case studieds were examined, it may also indicate that current 
taxonomies do not account well for actual change in CSs. For instance, relationships, 
once they are defined in the schema, seem to be fairly stable as relationship addition, 
elimination, or other alterations weren't encountered in the case studies. And entity 
integration is known in bottom-up design practices, but from the ‘entity promotion’ 
pattern (figure 6) it can be learned that entity integration comes at a cost that users are 
sometimes unwilling to pay. Finally, a taxonomy is symmetric by nature, as every 
change sequence in constructs can be read in both directions. But the semantic change 
patterns seen so far, appear to be asymmetric (with the exception of entity 
inscrtion/climination). In our ca.se studies the reduction of a state-aware entity into a 
snapshot entity was never observed, nor did we see merging of entities into one 
generalization. 

Our change patterns abstract the know-how experience that is present in maintenance 
and make it available to data administrators and researchers. In doing so, we have 
demonstrated the feasibility of this kind 'of research, and the relevance of its results. 
Data administrators will benefit from semantic change patterns because it will help 
them to quickly determine the scope and impact of necessary changes, and to select 
the best way to change the CS. Because the change is already well understood, 
conversions at the data-instance level and updating the CS documentation arc 
facilitated. In our opinion, best-practices in maintenance should be based on a 
thorough knowledge of change patterns. Data designers can use the semantic change 
patterns pro-actively. By experimentally applying them to a schema design, one can 
get a feel for changes that might be appropriate now, or in the near future. 

Continued research into schema evolution of operational CSs should provide a more 
complete coverage of semantic change patterns. A paper is planned that will describe 
features such as compatibility, size and impact of actual changes in more detail. This 
will improve the applicability of semantic change patterns as a tool for schema 
maintenance. Also, taxonomies of changes can be extended to incorporate those 
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semantic change patterns that are not yet accounted for. This will in turn lead to better 
flexibility of CSs to be designed. Another line of research is to investigate the 
business logic of change drivers, perhaps to the point where CS evolution can be 
predicted or even directed into strategically desirable directions. 
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Abstract 

The role of reverse engineering system metadata when migrating legacy 
systems to enterprise software such as PeopleSoft™ has not been widely ar- 
ticulated. Bridging the gap between new enterprisewide software systems 
and legacy systems has proven to be an enormous and costly hurdle when 
attempted without sufficient understanding of the legacy environment and 
enterprise metadata. We present a model-based methodology for construct- 
ing a metadata-based foundation for migrating data from legacy systems to 
enterprise software. The methodology components include; 

• analysis of project organization, constraints and definition, 

• development of project control structure, and 

• deployment of toolkit components to capture, analyze and utilize 
metadata. 

Methodologies work best when supported by appropriate tools. Based on 
widely available Office-suite components, a flexible toolkit was developed 
to support the methodology and facilitate system evolution. The toolkit al- 
lows users to capture, analyze, and publish various implementation-specific 
metadata. The toolkit publishes metadata for use by the technical imple- 
mentation team as well as by project management and business users. We 
describe each methodology component, the associated toolkit elements de- 
veloped to implement each component, the various component outputs, and 
the resources required to implement the solution. The methodology was 
developed in the context of an anonymous real world implementation. Or- 
ganizations can use this approach to create a sound basis for reverse engi- 
neering system metadata when migrating to enterprise software. 

Introduction 

The migration to enterprise software packages is a gargantuan, convoluted, and very 
expensive task for any organization. According to one widely quoted report, in a 
typical year three-quarters of all IT projects will either overrun or fail, resulting in 
almost $100 billion in unexpected costs [1], This research focused on the develop- 
ment of a toolkit methodology for use during a large-scale legacy migration effort. 
An interactive, simple, user-friendly toolkit was designed to provide a point-and-click 
environment for use by technical developers and business users during migration ef- 
forts, as well as templates for use during evolutionary phases. Hoping to quickly ac- 
cess system metadata and aid management while providing key facts vital to the mi- 
gration effort which describe the organization's legacy and target environments, the 
toolkit design provides a basis for communication between conversion project man- 
power. The toolkit contains multiple complementary components designed to cap- 
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ture, associate, connect, and manage legacy and target system metadata during reverse 
engineering [2 & 3] of system metadata, making it possible to track sources and uses 
of data [4]. The toolkit also supports problem identification, isolation and resolution 
and provides conversion issues tracking and other project details. In addition to assist- 
ing with target system application management, it provides on-line support for identi- 
fying, researching, and providing formal associations between metadata describing: 

• custom and native target system fields, files and associated properties; 

• legacy system fields, files and properties; 

• associations among and between target and legacy system fields, files and proper- 
ties; 

• design specifications associated with customized elements in the target system; 

• migration mappings between legacy and target system data items; and 

• business process constraints. 

The conceptual toolkit mapping capabilities are illustrated in 
Fig. 1. 



•Metadata 

(Legacy and Target Systems 
Fields. Files, Properties 
and Associations) 
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Fig. 1. Notional toolkit mapping capabilities. 



Step 1: Analysis of Project Organization, Constraints and Definition 

The integration of catalogued legacy/target system technical and project management 
metadata into a comprehensive repository supports effective and efficient control of 
system evolution, offering savings opportunities in terms of efficiencies realized, 
timely problem resolution and system documentation and maintenance. The straight- 
forward design encourages and facilitates user investment. In addition, the toolkit 
design was conceived as evolving into a catalogued cross-reference of information 
resources and becoming an essential foundation for its corporate data repository. 



Project Organization 



Implementation Philosophy. The history of the project and the size of the organiza- 
tion made it important to consider prevailing attitudes of management concerning 
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system architecture [5], Corporate structure was being analyzed and realigned to in- 
clude responsibility for enterprise development with the planned hiring of a system 
architect. Functional responsibilities of this position focus primarily on the dynamic 
system organization which interconnects the money, manpower, materiels, and MIS 
that characterize every organization. Evidencing a history of strong support for its 
operations via vigorous information systems services, the firm maintains a large, cen- 
tralized MIS department. The sponsoring organization, a national retail product resel- 
ler, committed to transition from an AS400/Software 2000 environment to a UNIX- 
based PeopleSoff“ environment as a replacement for 27 homegrown HR and pay- 
related systems. Middle management staff were added or shifted from other projects; 
technical resources were transferred or hired; functional users were assigned respon- 
sibility for defining requirements, identifying legacy data elements and processes, and 
verifying results. Project planning and budgeting functions were conducted resulting 
in estimated development time and cost. These were notably understated. The tedi- 
ous task of evaluating existing business processes as they related to the business rules 
logic of the target system began. Because many were not inherent in the target sys- 
tem, the not unusual decision was made to customize the target system. Customiza- 
tion included 708 custom tables and 2,359 custom fields. 



Project Constraints 



Pertinent employee legacy data resided on an AS400 server running Software 2000 
with multiple libraries, tables, fields, and processes to accommodate a workforce of 
over 100,000. Other relevant data was maintained on other AS400 servers, UNIX 
servers running Sybase, Informix, Allbase, Image, Oracle, and proprietary database 
applications to manage specific business processes. For example, one legacy custom 
application was in place to process data from hundreds of outlying locations and 
transmit it to centralized storage for processing and reporting. Predictably, quality of 
both the legacy data [6] and the legacy metadata [7] was too often inadequate, e.g., 
field element content descriptions were inaccurate; data was missing or incorrect. 
Conversion project documentation was often inadequate and lacking in version con- 
trol. Moreover, legacy-side technical expertise was already thinly spread and project 
HR experienced a high turnover during the migration process. These factors moti- 
vated an increase in the use of outside consultants. 

The pervasive use of consultants and needed rework of fit-gap analyses — to accom- 
modate subsequent release versions (from 5.0 to 7.5, incrementally), data migration 
mappings, and system requirements had multiple impacts on the conversion: project 
cost overruns, unmet development deadlines, the complexity of optimally coordinat- 
ing skill levels and experience. The conversion team relied on ad hoc database appli- 
cations, spreadsheets and text documents to document the system and migration proc- 
ess. Overall project progress was monitored with off-the-shelf project management 
software. 

During Year Two of the project, it became clear that the largest obstacle was an in- 
ability to reliably track legacy metadata and effectively control scop>e creep. An avail- 
able toolkit was identified, but the base price tag, $85,000.00, represented major cost 
overrun to an already out-of-budget project — especially in terms of changes required 
to reflect site customizations. Nor was it compatible with the CASE tool chosen by 
the firm for long-term data modeling and object management. In light of this, the 
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organization opted to design its own methodology and toolkit to support the conver- 
sion effort. 



Project Definition 



Development of Project Control Structure. The toolkit development team used a 
four-component process to better understand and address conversion team needs. 
Consequently, opportunities were identified for improved efficiency and effectiveness 
by creating a conversion model incorporating all high-level management and produc- 
tion requirements, illustrated in Fig. 2. 

CONVERSION JIGSAW PUZZLE 




Fig. 2. Migration Project Control Issues 



Component 1 


Formal/informal interviews with business and technical staff 
and consulting experts to identify and model order of procedure 
for migration. 


Component 2 


Identification of in-house tools for tracking and documenting 
evolution, with emphasis on retaining, realigning or redesign- 
ing for future use. 


Component 3 


Analysis, design (or redesign) and integration of various tools 
(such as databases, spreadsheets, templates, etc.) to be used 
throughout evolution. 


Component 4 


Development of trouble-shooting and problem resolution tools 
to be used once installation of target system component was 
accomplished and end users began interacting with the new en- 
vironment. 
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This methodology produced the framework of the toolkit, providing tracking capabili- 
ties and making possible point-and-click navigation through data mapping informa- 
tion, data source information and data use information, providing answers to 
many conversion-specific questions. See, The BRIDGE - Metadata Directory, below. 

Step 2: Formulating the Control Structure. Determining the precise steps through 
which conversion advances led to development of the on-line structure required to 
plan, organize, direct and control the project and, further, capture current and appro- 
priate state-of-the-system for use by a variety of users: production (analysis, design, 
construction, migration and implementation), support (life cycle maintenance), func- 
tional (business process) and technical (infrastructure). By linking these aspects, 
project management enforces standards and version control while ensuring the most 
accurate environment in which to make project decisions. It is important, here, to cri- 
tique the value of operating system functionality. In a Windows-based environment, 
it is essential that the “linking” functionality be employed to enforce version control 
by strict CRUD access to project documentation. The BRIDGE control structure is 
based on the order of events in the accomplishment of a system evolution: (1) con- 
cept, (2) analysis, (3) design, (4) construction, (5) migration, (6) implementation, and 
(7)support. It further reflects system constraints inherent to any system evolution: (1) 
management, (2) infrastructure, and (3) special issues (such as Y2K compliance). In 
addition to the very precise directory structure. The BRIDGE provides templates and 
links to documents intended to capture essential project information and produce pro- 
ject deliverables: high-level design, fit-gap analysis, general design, conceptual de- 
sign, functional design specification, and technical design specification documents. 
Other documentation is also available. 

Step 3: Deployment of Toolkit Components to Capture, Analyze and Utilize 
Metadata 

A two-member team of students from Virginia Commonwealth University, Informa- 
tion Systems Research Institute (ISRI) was tasked with producing a target system data 
model and toolkit to report data source tracking, data use tracking and metadata for 
data mappings. What follows is a report of the development of The BRIDGE, includ- 
ing summaries of legacy and target system metadata capture, development of the Peo- 
pleSoft™ data model, and design of data source tracking and data use tracking func- 
tions. 



Task: Reverse Engineering for System Evolution 



Employing generally accepted reverse engineering techniques, the development team 
set out to accomplish the following steps. 

1 . Identify legacy libraries and files of data elements to be migrated. 

2. Identify use and "owner" of each. 

3. Identify need for any modification to each. 

4. Identify legacy processes utilizing each legacy file/field. 

5. Identify dependent legacy and target processes and outputs utilizing each. 

6. Identify functional owner of each legacy process. 

7. Identify functional user of each legacy output. 

8. Extract and load legacy metadata to The BRIDGE. 

9. Extract and load target system metadata to The BRIDGE. 
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10. Create data mappings and associations. 

1 1 . Publish reports as required. 

Reverse Engineering Issues. Legacy metadata was extracted from numerous locations, 
with varying degrees of automated support available. Early issues regarding 
application-level protocol interconnectivity were addressed and resolved. Given the 
pervasive use of Microsoft products, the decision was made to implement using off- 
the-shelf Office-suite components. Additionally, the integrated suite offered: (1) an 
ability to migrate to Informix; (2) the ability to publish toolkit elements in HTML or 
to a Lotus Notes database; (3) user-friendly design and testing features; (4) importing 
and exporting capabilities; and (5) no additional procurement cost. 



Mapping documents for many target system tables had been produced, however, these 
were merely MS Excel spreadsheet templates requiring manual input of target system 
information and associated legacy sources to document relationships. They were pro- 
duced by a complex process of requesting a PeopleSoft record definition on-line, 
printing it, then re-keying it into Excel spreadsheets. Token effort was made at ver- 
sion control and naming conventions, but these quickly became inadequate. To speed 
up the process and provide valid decision-making information, metadata for both 
target and legacy systems was imported to The BRIDGE to permit cross-system ele- 
ment mapping. 

Target System Metadata Analysis 



Target system metadata was contained in tables within the DBMS. Tools were avail- 
able to query all file names and descriptions [8]. Query results were published in Ex- 
cel format and analyzed to categorize the types and functions of records. In addition 
to basic metadata tables, various cross-reference, process, report, and other tables 
became vital metadata sources for creating the target system metadata model. A list of 
the 253 PeopleSoft Version 7 site metadata tables is shown in Fig. 3. 

Relying in part on previous research (see [9 & 10]), the identified metadata files were 
categorized, ranked, and published to Excel for import to Access following analysis. 
Analysis provided a traceable route through the target system. The team created a 
data commonality analysis by listing all metadata files and their elements and cross- 
referencing duplicates or pseudonyms. Associating common elements among the 
files, the team was able to construct a sequence logic to track the internal target sys- 
tem functioning. Applying the sequence logic, the team, using target system meta- 
data, could precisely track the route of each field through the target system as it inter- 
acts with all activities and events occurring within the target application (Fig. 4). 
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Fig. 3. Enterprise Software Metadata Tables Listing 
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Fig. 4. PeopleSoft v. 7.01 Metadata for Element Routing (Data Uses) 



By isolating any given target system field, its properties and associations with events 
(e.g., notifying Benefits personnel of new hires) and activities (e.g., as computing 
benefits packages statistics) can be determined. Based on the logic established, a tar- 
get system data metadata model was created (Fig. 5). 




Fig. 5. Target System Data MetaModel 
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Legacy System Metadata Analysis 



Legacy system metadata was far more complex. This phase focused initially on 
seemingly well-structured legacy data managed by Software 2000 (S2K), narrowing 
the analysis focus and providing a legacy data subset to model. Legacy system analy- 
sis also focused on an interconnected hardware and software system of proprietary 
and modified off-the-shelf applications used to collect, manage, and report corporate 
data. 

To correctly "address" any given element within S2K, the following locations must be 
known: (1) the server on which the DBMS is employed, (2) the database segments 
that exist, (3) the files within those segments, and (4) the elements of those files. Us- 
ing these facts and creating a unique database identification number based on the 
server and DBMS designations, each element can be tracked forward or backward to 
database segments, files and fields (Fig. 6). 
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Fig. 6. Enterprise Metadata Model 



S2K utilities produce hard copy system metadata reports, but not soft copies that can 
be manipulated to extract the relevant metadata such as field name, field length, field 
attribute and field descriptions. Informix scripts were written to extract the S2K meta- 
data to migrate the data to Excel in preparation for import to The BRIDGE. 



Task; Design and Develop Toolkit Components 

The BRIDGE was designed to assist in the implementation of enterprisewide applica- 
tions based on metadata describing systems environments. Obviously of interest to 
project management is the resultant reduction to costs associated with error and inef- 
ficiency. By maintaining a centralized and secure repository, keying errors, misspell- 
ings, guesswork, and, in general, "information deficiency" can be virtually eliminated. 
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Toolkit components designed to avoid these costly errors are illustrated in Fig. 7 be- 
low. 

Design of the toolkit hinged on several basic components: (1) identification of the 
migration process steps; (2) identification and analysis of state-of-the-system migra- 
tion methodologies in place at each stage, e.g., analysis of legacy metadata, correla- 
tion to target system metadata, data capture/load, issues management, process run 
control, roll-out, clean-up and documentation; (3) integration of heterogeneous legacy 
environment metadata into a seamless whole to avoid redundancy, implement version 
control, security, etc.; and (4) need for user interfaces to streamline future target sys- 
tem management and support. As expected, many opportunities were identified for 
production of standards, procedures, conversion-specific tools and generally im- 
proved organization of the project deliverables. The expected result was better com- 
munication among technical and business migration team members, fewer instances 
of error, diminished demands on systems resources, fewer reworks, and ultimately a 
smoother overall conversion process. 

The BRIDGE: System Evolution Management Toolkit 

Based on analysis of elements basic to any system evolution and on the reverse engi- 
neering methodology employed, tools were designed to assist cross-functionally with 
the implementation of an enterprisewide software package. What follows is a brief 
description of each database tool. 
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Fig. 7. The BRIDGE - Conversion Project Database Tools 
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The BRIDGE - Data Geography: Tables were created in The BRIDGE to accommo- 
date the "key" that "opens" each data element "address." Assigning a unique key to 
(1) each database/server combination, (2) each segment of each database/server com- 
bination, (3) each file from each segment, and (4) each data element of each file al- 
lows users to locate elements and related metadata, identify the source of converted 
data, and associate element use with legacy processes. 

The BRIDGE guides users through three areas most likely to require support; (a) 
metadata issues, (b) data sources, and (c) data uses. The main menu narrows search 
categories to primary target system elements: fields, records, panels, menus, views 
and reports. Users can access information regarding the properties and sources and 
uses of those elements. 



The BRIDGE - Metadata Directory: From random sampling of testing incidents 
recorded by the testing team, The BRJEXjE was designed to lead users to answers to 
questions such as the following. 

• What custom/native fields and files exist within the legacy system and the 
target system, and what re their properties (name, type, length, format, etc.)? 

• What target system structures, processes and reports are associated with a 
particular field and/or file? 

• What technical design specifications are associated with custom elements of 
the target system? 

• What legacy element, or combinations, constitute the target system ele- 
ments? 

• What system processes require access to target system data? 

• What legacy processes depend on target system data? 



Because target system metadata is accessible in ODBC format, directly linking 
BRIDGE components to the target system permits metadata queries and provides real- 
time database-state reporting. The savings realized by eliminating the need to 
routinely refresh The BRIDGE offsets performance limitations. 



Since "data" truly is the combination of a fact and a meaning, then metadata provides 
in particular format the "information and documentation that makes data sets under- 
standable and shareable for users" [11]. In addition to metadata for each element, the 
user can quickly explore the relationships between any primary element and any other 
primary elements. 

The BRIDGE - Data Mapper: The screen shown in Fig. 9 illustrates one of the most 
important toolkit functions. Data Mapper provides point-and-click functioning to 
identify any given data element, review its properties, match it with any underlying 
(converted) data element, and point it toward its new configuration. By restricting 
data "addresses" to one-to-one relationships between the four-element key, it is possi- 
ble to establish any required many-to-one relationships between data elements by 
adding "NewFieldID#." This is especially useful when legacy data elements are 
combined to produce a single, target system data element. 
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Fig. 8. Data Source Tracking 



The information captured by Data Mapper leads to data source/use information. Pro- 
viding the user with options for qualifying each evolving field, the conversion team 
has access to centralized critical information and documentation related to each field. 
In addition, Data Mapper provides a tool for tracking issues regarding the conversion 
of legacy data to the new environment. It provides space for toolkit users to make 
notes, raise questions or log resolutions regarding specific fields. Also included with 
this screen is a field to accommodate a direct link to technical specification documen- 
tation for individual target system elements. 

Finally, Data Mapper is designed to open the data map for modification as necessary, 
date-stamping and modifying the status of the earlier version, then creating an addi- 
tional row in The BRIDGE table. Data Source Tracking. As this function is em- 
ployed, it captures a modification date so that system maintenance can purge un- 
needed information as indicated. Of course, updating migration issues relative to 
modification provides additional valuable information about relationships between 
legacy and target environments. 

The BRIDGE - Migration Rules Directory: This function, a subscreen of Data Map- 
per, allows the developer to capture psuedocode (or even code language) for legacy 
data extraction, rebuilding, and loading if no ETL is employed. The Migration Rules 
Directory can contain code which, when queried, reports in text format that can be 
utilized in writing future extract procedures if needed. 
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The BRIDGE - Migrations Tracking: The migrations tracking database provides for 
the capture of migration, and re-migration, activity information by specification iden- 
tifier, as well as sign-off tracking necessary to manage the migration process. 

The BRIDGE - Business Process Tracking: Information about each business proc- 
ess — identifiers, process steps, dependencies, contacts, data inputs, outputs, and test- 
ing information such as test script identifiers, break points and recovery steps — is 
contained in a separate database (see Fig. 10). This tool permits the tracking of sub- 
system information, including business processes, interfaces, production applications, 
support applications, and implementation-specific details. 

The BRIDGE- Design Specification Tracking: Via its Tech Specs Link, The BRIDGE 
provides a hyperlink connection to text documents containing design and functional 
requirements and other pertinent information about the specification. The information 
in this table originated as a spreadsheet to provide intermediate search capabilities 
during the migration project. By converting the spreadsheet to an Access table, users 
retain this capability while gaining the ability to access centralized information via 
link to the element composite. Tech Spec Link also provides summary information 
and linking for individual specifications via the main menu. 

The BRIDGE - Patch Activity Tracking: Formalized change management for required 
post-implementation modifications must be enforced. This is true not only of target 
system construction, but also of target system infrastructure management. Target sys- 
tem updates are electronically published weekly in text format by the vendor. A 
"Patches and Fixes" component was developed to load and track information about 
those updates. Capture of update information is primarily a manual operation involv- 
ing the cutting and pasting of information provided by the software vendor. Utilizing 
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Word and Excel to massage and reformat the electronically provided information, the 
“patches and fixes” data is imported to the appropriate database tool. 







Fig. 10. Data Use Tracking 



The BRIDGE - User Log: User Log is a transparent function that tracks number of 
users and search strings, qualifying each by primary element type researched. By 
adding date/time stamps. User Log provides the opportunity to track which elements 
present the most challenges, how the toolkit is being used, the frequency of use and 
efficiency of the tool. 

Results 

The toolkit prototype generated significant interest across the team. Although factors 
related to development of The BRIDGE surfaced as early as four months prior to ini- 
tial design and development of the toolkit, the cost associated with design and devel- 
opment of the toolkit consisted of direct labor costs exclusively. The cost of the two- 
person development team, working approximately 20 hours per week each for a pe- 
riod of four months, was less than $20,000. The four-month project included analysis 
and design, and extraction and load of test and live metadata. Given the almost 50% 
and 100% overruns for implementation project cost and time respectively (totaling 
millions of dollars), the "price tag" represents a significant cost savings, assuming 
early deployment of the toolkit could have eliminated much of the subsequent rework. 
Having no "parallel" test case against which to measure actual savings prohibits pro- 
jections of savings to be anticipated by other users of The BRIDGE methodology. 
Based solely on a comparison to the price of an application-specific support tool. The 
BRIDGE also represents a cost savings in terms of initial outlay. 
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Conclusion 



Understanding the contents and uses of legacy data when evolving from a legacy sys- 
tem to a new environment and reverse engineering systems to identify and formally 
locate the facts and meanings of corporate data has other benefits as well [12]. For 
example, development of corporate data "minimarts” that ultimately contribute to 
corporate data warehouses becomes simpler and more straightforward. By formally 
"addressing" the locations and tracking metadata of the system, more useful decision- 
making information becomes available to teams predicting performance and control- 
ling operations. When employed in conjunction with the implementation methodol- 
ogy discussed here, the migration effort will run more smoothly, be more efficient, 
and cost less. 
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Abstract. Database reverse engineering is a complex activity that can be mod- 
eled as a sequence of two major processes, namely data structure extraction and 
data structure conceptualization. The first process consists in reconstructing the 
logical - that is, DBMS-dependent - schema, while the second process derives 
the conceptual specification of the data from this logical schema. This paper 
concentrates on the first process, and more particularly on the reasonings and the 
decision process through which the implicit and hidden data structures and con- 
straints are elicited from various sources. 



1 Introduction 

Reverse engineering a piece of software consists, among others, in recovering or 
reconstructing its functional and technical specifications, starting mainly from the 
source text of the programs. Recovering these specifications is generally intended to 
redocument, convert, restructure, maintain or extend legacy applications. 

In information systems, or data-oriented applications, i.e., in applications the central 
component of which is a database (or a set of permanent files), it is generally consid- 
ered that the complexity can be broken down by considering that the files or database 
can be reverse engineered (almost) independently of the procedural parts, through a 
process called Database Reverse Engineering (DBRE in short). 

This proposition to split the problem in this way can be supported by the following 
arguments. 

• The semantic distance between the so-called conceptual specifications and the 
physical implementation is most often narrower for data than for procedural parts. 

• The permanent data structures are generally the most stable part of applications. 

• Even in very old applications, the semantic structures that underlie the file struc- 
tures are mainly procedure-independent (though their physical structures may be 
highly procedure-dependent). 

• Reverse engineering the procedural part of an application is much easier when the 
semantic structure of the data has been elicited. 

Therefore, concentrating on reverse engineering the application data components first 
can be much more efficient than trying to cope with the whole application. 

Even if reverse engineering the data structure is "easier" than recovering the specifi- 
cation of the application as a whole, it is still a complex and long task. Many tech- 
niques, proposed in the literature and offered by CASE tools, appear to be limited in 
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scope and are generally based on unrealistic assumptions about the quality and com- 
pleteness of the source data structures to be reverse engineered. For instance, they 
often assume that all the conceptual specifications have been declared into the DDL 
(data description language), so that all the other information sources are ignored. In 
addition, the schema has not been deeply restructured for performance or for any other 
requirements, and names have been chosen rationally. 

These conditions cannot be assumed for most large operational databases. Since the 
early nineties, some authors have recognised that the analysis of the other sources of 
information is essential to retrieve data structures ([1], [4], [8], [9] and [10]). 

The constructs that have been declared in the DDL are called explicit constructs. 
On the opposite, the constraints and structures that have not been declared explicitly 
are called implicit constructs. The analysis of DDL statements alone leaves the 
implicit construct undetected. 

Recovering undeclared, and therefore implicit, structure is a complex problem, for 
which no deterministic methods exist so far. A careful analysis of all the information 
sources (procedural sections, documentation, database contents, etc.) can accumulate 
evidences for those specifications. 




Fig. 1. The major processes of the reference DBRE methodology (left) and the development of 
the data stmcture extraction process (right). 

This paper presents in detail the refinement process of our database reverse engi- 
neering methodology. This process extracts all the implicit constructs through the ana- 
lyze of the information sources. The paper is organized as follows. Section 2 is a 
synthesis of a generic DBMS-independent DBRE methodology. Section 3 describes 
the data structure extraction process. Section 4 discus how to refine a schema to enrich 
it with all the implicit constraints. Section 5 presents a DBRE CASE tool which is 
intended to support data structure extraction, including schema refinement. 
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2 A Generic Methodology for Database Reverse Engineering 

The reference DBRE methodology [4] is divided into two major processes, namely 
data structure extraction and data structure conceptualization (Fig. 1, left). These 
problems address the recovery of two different schemas and require different concepts, 
reasoning and tools. In addition, they grossly appear as the reverse of the physical and 
logical design usually admitted in database design methodologies [2]. 



2.1 Data Structure Extraction 

The first process consists in recovering the complete DMS schema, including all the 
explicit and implicit structures and constraints, called the logical schema. 

It is interesting to note that this schema is the document the programmer must con- 
sult to fully understand all the properties of the data structures (s)he intends to work 
on. In some cases, merely recovering this schema is the main objective of the pro- 
grammer, who can be uninterested in the conceptual schema itself. 

This process will be discussed in detail in section 3 and 4. 



2.2 Data Structure Conceptualization 

The second phase addresses the conceptual interpretation of the logical schema. It 
consists for instance in detecting and transforming or discarding non-conceptual struc- 
tures, redundancies, technical optimization and DMS-dependent constructs. 

The final product of this phase is the conceptual schema of the persistent data of the 
application. More detail can be found in [5]. 



2.3 An Example 

Fig. 2 gives a DBRE process example. The files and record declarations (2. a) are 
analyzed to yield the physical schema (2.b). This schema is refined through the analy- 
sis of the procedural parts of the program (2.c) to produce the logical schema (2.d). 
This schema exhibits two new constructs, namely the refinement of the field CUS- 
DESC and the foreign key. It is then transformed into the conceptual schema (2.e). 



3 Data Structure Extraction 

The goal of this phase is to recover the complete DMS schema, including all the 
implicit and explicit structures and constraints. The main problem of the data structure 
extraction is to discover and to make explicit, through the refinement process, the 
structures and constraints that were either implicitly implemented or merely discarded 
during the development process. In the reference methodology we are discussing, the 
main processes of data structure extraction are the following (Fig. 1, right): 
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Select CUSTOMER assign to "cus.dat" 


; FD CUSTOMER. 


organisation is indexed 


. 01 CUS. 


record key is CUS-CODE. 


■ 02 CUS-CODE pic X(12) . 


Select ORDER assign to "ord.dat" 


I 02 CUS-DESC pic X{80). 


organisation is indexed 


1 FD ORDER. 


record key is ORD-CODE 


' 01 ORD. 


alternate record key is ORD-CUS 


I 02 ORD-CODE PIC 9(10). 


with duplicates. 


• 02 ORD-CUS PIC X(12). 


a) the files and records declaration 


' 02 ORD-DETAIL PIC X(200). 



DDL code analysis 



01 DESCRIPTION. 


1 

1 accept CUS-CODE. 


02 NAME pic x(30) . 


• read CUSTOMER 


02 ADDRESS pic x(50). 


' not invalid key 




1 move CUS-CODE 


move CUS-DESC 


■ to ORD-CUS 


to DESCRIPTION. 


] write CUS. 

1 

1 


c) procedural fragments 




schema refinement 



cus 



CUS-CODE 

CUS-DESC 

NAME 

ADDRESS 



id: CUS*CODE 



ORD 



ORD-CUS 

ORD-DETAIL 



id: ORD-CODE 
ref: ORD-CUS 
acc 





d) the logical schema 



c 

N 

C/5 

O 3 
o. 
0 > 
o 
B 
O 
o 



CUSTOMER 
CODE 
NAME 
ADDRESS 
id: CODE 



0-N 




1-1 



ORDER 
CODE 
DETAIL 
id: CODE 

e) the conceptual schema 



Fig. 2. Database reverse engineering example. 

• DDL code analysis: analysing the data structure declaration statements to extract 
the explicit constructs and constraints, thus providing a raw physical schema. This 
is one of the simplest process in DBRE which can be automated easily (most data- 
base-oriented CASE tools provide a collection of such analyzers). 

• Physical integration: when more than one DDL source has been processed, several 
extracted schemas can be available. All these schemas are integrated into one glo- 
bal schema. The resulting schema {augmented physical schema) must include the 
specifications of all these partial views. 

• Schema refinement: the explicit physical schema obtained so far is enriched with 
implicit constructs made explicit, thus providing the complete physical schema. 
This process will be discused in section 4. 
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• Schema cleaning: once all the implicit constructs have been elicited, technical con- 
structs such as indexes or clusters are no longer needed and can be discarded in 
order to get the complete logical schema (or simply the logical schema). 

The final product of this phase is the complete logical schema, that includes both 
explicit and implicit structures and constraints. This schema is no longer DMS-com- 
pliant for at least two reasons. Firstly, it is the result of the integration of different 
physical schemas, which can belong to different DMS. For example, some part of the 
data can be stored into a relational DB, while others are stored into standard files. Sec- 
ondly, the complete logical schema is the result of the refinement process, that 
enhances the schema with recovered implicit constraints. 

The data structure extraction process is often easier for true database than for stand- 
ard files. Indeed, databases have a global schema (DDL text) that is immediately 
translated into a physical schema. On the contrary, each program includes only a par- 
tial schema of standard files. At first glance, standard files are more tightly coupled 
with the programs and there are more structures and constraints buried into the pro- 
grams. Unfortunately, all the standard file tricks have been found in recent applications 
that use "real" database as well. The reasons can be numerous: to meet other require- 
ments such as reusability, genericity, simplicity, efficiency; poor programming prac- 
tice; the application is a straightforward translation of a file-based legacy system; etc. 

Some kind of integration can also be necessary later to integrate different compo- 
nents of the schema that are represented by different structures (different record types 
for example) but represent the same concept. This can be discovered during the 
schema refinement process that will be presented in the next section. 



4 Schema refinement 

The main problem of the data structure extraction phase is to discover and to make 
explicit, through the refinement process, the structures and constraints that were either 
implictly implemented or merely discarded during the development process. The vari- 
ety of implicit constructs can be very large, the main implicits structures and con- 
straints we are looking for are the following: record types and fields desaggregation, 
identifier, foreign keys, functional dependency, meaningful names, etc. 

In this section, we analyse why we need different sources of information, we 
describe the elicitation techniques, then we present a generic refinement methodology. 
We conclude by the analysis of the automatization of the refinement process. 



4.1 The Information Sources 

To discover an implicit construct the analyst cannot limit its analysis to one informa- 
tion source. On the contrary (s)he has to rely on all the possible information sources. 
Those sources are for example: application programs, data, HMI procedural frag- 
ments, screen and report layout, generic DMS code fragments', existing documenta- 
tion, interviews, domain knowledge, operation environment knowledge, etc. 
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(S)he needs to analyze several of those sources because none of them contains all 
the hints for all the constraints. For example, some constraints are not implemented in 
the application program because they are verified by some environmental properties 
(the input data are always correct, they come from another fully reliable application). 
On the other hand, spurious constraints can be discovered in the data (for example, a 
field is an identifier) because the set of data is too small. Or constraints are not veri- 
fied because there is some erroneous data. 

Many data structures and constraints, that are not explicitly declared, are coded, 
among other, as procedural sections of the programs. For this reason, one of the most 
important information source is the program text sources. 



4.2 The Refinement Techniques 

For each information source there exists a set of analysis techniques. These techniques 
can be very simple such as visual inspection of the data or more complex such as 
dynamic analysis of executing programs. 





FD CUSTOMER. 
01 CUS. 








02 CUS-NUM PIC 9(3) . 

02 CUS-NAME PIC X(10). 

02 CUS-ORD PIC 9(2) OCCURS 10. 




FD CUSTOMER. 




01 ORDER PIC 9(3) . 




01 CUS. 








02 CUS-NUM PIC 9(3). 


1 


ACCEPT CUS-NUM. 




02 CUS-NAME PIC X(10) . 


2 


READ CUS KEY IS CUS-NUM. 


1 


ACCEPT CUS-NUM. 


3 


MOVE 1 TO IND. 


2 


READ CUS KEY IS CUS-NUM. 


4 


MOVE 0 TO ORDER. 


8 


DISPLAY CUS-NAME. 


5 


PERFORM UNTIL IND=10 






6 


ADD CUS-ORD (IND) TO ORDER 






7 


ADD 1 TO IND. 






8 


DISPLAY CUS-NAME. 






9 


DISPLAY ORDER. 






a) 


COBOL program P 


b) 


Slice of P with respect to CUS- 
NAME and line 8 



Fig. 3. Example of program slicing. 



Program understanding is a very active area within the software engineering field. 

It is the process of acquiring knowledge about an existing computer program [6]. The 

main program understanding techniques are; 

• searching the program source text for some patterns or cliches; 

• dependency graph: the dependency graph is a graph where each variable of a pro- 
gram is represented by a node and an arc represents a direct relation (assignment, 
comparison, etc.) between two variables; 

• program slicing: the slice of a program with respect to program point p and variable 
X consists of all the program statements and predicates that might affect the value x 
at point p. This concept was originally discussed by M. Weiser [11]. Fig. 3. a is a 



1 Some DMS offer general functionality to enforce a large variety of constraints on the data. 
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small COBOL program that asks for a customer number (CUS-NUM) and displays 
the name of the customer (CUS-NAME) and the total amount of its order (ORDER). 
Fig. 3.b shows the slice that contains all the statements that contribute to displaying 
the name of the customer, that is the slice ofP with respect to CUS-NAME at line 8. 



B 




A 


B1 

B2 


<l< 


ref:B2 


id:Al 





Fig. 4. An elementary abstract schema including a foreign key. A1 is a identifier of table A and 
B2 is a foreign key targeting table A. 



Sometimes it can be useful to have different visualization techniques for the same 
information. A program can be visualized as a text or as a call graph. 

Data can be analyzed through queries that verify whether a constraint is present or 
not. For example, it is possible to write a query to verify that, in Fig. 4, B2 is a foreign 
key targeting table A: 

select count!*) from B where B2 not in (select A1 from A) ; 

If the result is 0, then B2 could be a foreign key, otherwise it cannot. 
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Schema 

enhancement 
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Schema 
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Fig. 5. The refinement method. 



4.3 Schema Refinement Method 

Due to the large amount of information to manipulate, attempting an exhaustive search 
for all the imaginable constraints is unrealistic. We need a methodology that guide us 
in our constraints investigation. This methodology will reduce the search space to the 
possible constraints. For example, it is not realistic to query the data to check for each 
field combination of the database being an identifier. We have to decide which fields 
are potential identifiers depending of their name (containing key word as “code”, “id”, 
“number”, etc.), their structure (mandatory), their position in the record (the first field 
of the record), etc. 
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Fig. 5 sketches a schema refinement method, where rectangles represent process, 
ellipses represent the different products (schema, data passed from one process to the 
other, ...), diamonts shapes represent decision points. The execution flow is material- 
ized by bold arrows and normal arrows represent product usage. The schema analysis 
process analyses the Augmented physical schema to find missing constraints or con- 
structs, called hypothesis. The domain knowledge and the database design knowledge 
are also used to discover missing constraints during the schema analysis. Then we 
analyse the other sources of information to validate the hypothesis (hypothesis valida- 
tion). If the hypothesis is validated then the schema enhancement process enriches the 
schema with the validated constraint. We iterate until no new hypothesis are generated 
by the schema analysis. 

This method can be instantiated for each DBRE project. Schema analysis depends 
on the constraints we are looking for and hypothesis validation change according to the 
information source and the constraint we want to validate. Indeed, database reverse 
engineering basically is a loosely structured learning process which varies largely from 
one project to another. We can sketch, as an example, the following principles for for- 
eign key elicitation that can apply on relational databases managed by early RDBMS, 
in which no keys were explicitly declared. This example is based on the schema of the 
Fig. 4. 



Phase 


Heuristics 


Short description 


Schema 

analysis 


name analysis 

name analysis 
domain knowledge 

domain knowledge 

technical constructs 


Name of column B.B2 suggests a table, or an external id, 
or includes keywords such as ref, ... 

Select table A based on the name of B2 

Objects described by B are known to have some relation 

with those described by A 

Find a table describing objects which are known to have 
some relation with those described by B 
Search the schema for a candidate referenced table and id 
(with same type and length) 


Hypothesis 


B.B2» A.A1 


field B2 ofB is a foreign key to identifier AI of A 


Hypothesis 

Proving 


technical constructs 
technical constructs 
dataflow analysis 
usage pattern 

usage pattern 
usage pattern 
usage pattern 

usage pattern 


There is an index on B2 

B.2 and A.Al are in the same cluster 

B.2 and A.Al are in the same dataflow graph fragment 

A. Al values are used to select B rows with same B2 val- 
ues 

B. 2 values are used to select A row with same Al value 
A B row is stored only if there is a matching A row 
When an A row is deleted, B rows with B2 values equal 
to A.Al are deleted as well 

There are views based on a join with B.B2 = A.Al 


Hypothesis 

Disproving 


data analysis 


Prove that some B.B2 values are not in A.Al value set 



It is a good practice to apply as many heuristics as we can. Because if an heuristic 
succeeds, it does not mean that the hypothesis is verified. On the opposite, if a heuris- 
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tics fails, the hypothesis is not necessarily disproved. We can formalize this as fol- 
lows; 

• we formulate hypothesis h on the existence of an implicit construct C; 
so far, h is stated with probability Po<l ', 

• we apply heuristics H; h is now stated with probability pf 

• ifH succeeds 

Pj > Po\ the existence of C is more certain, though /?;<!. 

For instance, in the example above, if there is an index on B2 it is one more 

evidence that B2 is a foreign key to the identifier A1 of A, but we are not yet 

completely certain. 

• if fails, one the three interpretations can hold: 

• Pi = 0-, his disproved, so that we accept that C does not exist. 

For example, if half of the value of B.B2 are not in A.Al value set, we can 
say that there is no foreign key from B.B2 to A.Al. 

• pi <Pq; his less certain, but could still be proved through other heuristics. 
For example, if there is only one value of B.B2 (out of one million) that is 
not in A.Al value set, we can not conclude there is no foreign key, but it is 
perhaps an error in the data. 

• H does not contribute to the search; p/ = Po- 

For example, if we didn’t find that B.B2 and A.Al are in the same dataflow 
graph fragment. This can be because we have chosen a program that does 
not use the foreign key (a program that only manipulate B and not A). 

The experience has shown that: 

• Analysing all the information sources generally proves too expensive, so that the 
analyst has to determine which sources to analyse. 

• A hypothesis cannot be proved by heuristics alone, it is up to the analyst to decide 
when (s)he is convinced the hypothesis is validated. 

• This method considers that all the information are reliable, what about the result of 
an heuristic applyied on unreliable information (corrupted data, programming 
errors)? 

• When we are validating an hypothesis, we can discovered other constraints that 
must be added to the schema. For example, when analyzing a program slice com- 
puted to verify that a field is a foreign key, other constraints about this field can be 
found, because the slice contains all the program instructions that influence the 
value of the field. 



4.4 How to Decide that Refinement is Completed 

The goal of the reverse engineering process has a great influence on the output of 
the refinement process and on its ending condition. The simplest DBRE project can be 
to recover only the list of all the records with their fields. All the other constraints are 
useless. This can be useful to make a first inventory of the data used by an application 
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to prepare another reverse or maintenance project (as Y2K conversion). In this kind of 
project, the logical schema is the final product of DBRE. 

On the other end, we can try to recover all the possible constraints to have a com- 
plete view of the database. This can be necessary in a migration project where we 
want to convert a collection of standard files into a relational database. 

It is the analyst’s responsibility to decide that the schema is complete and all the 
needed constraints have been extracted. 



4.5 Process Automation 

The schema refinement process basically is a decisional activity that cannot be fully 
automated. Many analysis techniques are not intended to locate and find implicit con- 
structs, but rather contribute to the discovery of these constructs by focusing the ana- 
lyst’s attention on special text or structural patterns. In short, they narrow the search 
scope. It is up to the analyst to decide if the constraint that is looked for is present or 
not. For example, computing a program slice provides a small set of statements with a 
high density of interesting patterns according to the construct that is searched for (typ- 
ically a foreign key). This small program segment must then be examined visually to 
check whether traces of the construct are present or not. 

Another reason for which full automation cannot be reached is that each DBRE 
project is different. The source of information, the underling DBMS or the coding 
rules, can all be different and even incompatible. 

Despite these restrictions, automation is highly desirable for large projects in which 
huge volumes of information have to be explored. Portfolios of more than 10,000 pro- 
grams and databases of more than 500 files/tables are not unfrequent*. 

This automation can be of different kinds: 

• Some processes can be fully automated. For example, during the schema analysis 
process, it is possible to have a tool that detects all the possible foreign key that 
meet some matching rules (the target is an identifier and the candidate foreign keys 
have the same length and type as their target). 

• Other processes can be partially automated with some interaction with the analyst. 
For example, we can use the dataflow diagram to detect automatically the fields 
decomposition. There is an interaction with the analyst to resolve conflicts (two 
different decomposition for the same fields). 

• We can define tools that generate reports so that the analyst can analyse them to val- 
idate the existence of a constraint. For example, we can generate a report with all 
the fields that contain the key words “id”, “code” and are the first field of their 
record. The analyst must decide which fields are candidate identifiers. 



1 The complexity of DBRE projects is between 0(N) and O(N^), where N is the number of 
entity types of the schema. Indeed, each new entity type can be related with all the entity 
types of the schema. The analysis effort to process a 500-table database can be up to 100 
times greater than for a 50-table database 
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5 The DBRE Functions of DB-MAIN 

Several industrial projects have proved that powerful techniques and tools are 
essential to support DBRE, especially the data structure extraction process, in realistic 
size projects. These tools must be integrated and their results recorded in a common 
repository. In addition, the tools need to be easily extensible and customizable to fit the 
analyst's exact needs. 

DB-MAIN is a general-purpose database engineering CASE environment that 
offers sophisticated reverse engineering toolsets. DB-MAIN is one of the results of a 
R&D project started in 1993 by the database team of the computer science department 
of the University of Namur (Belgium). Its purpose is to help the analyst in the design, 
reverse engineering, migration, maintenance and evolution of database applications. 

DB-MAIN offers the usual CASE functions, such as database schema creation, 
management, visualisation, validation, transformation, and code and report generation. 
It also includes a programming language (Voyager2) that can manipulate the objects of 
the repository and allows the user to develop its own functions. More detail can be 
found in [3] and [5]. 

DB-MAIN also offers some functions that are specific to the data structure extrac- 
tion process [6]. The extractors extract automatically the data structures declared into 
a source text. Extractors read the declaration part of the source text and create corre- 
sponding abstractions in the repository. The foreign key assistant is used to find the 
possible foreign keys of a schema. Giving a group of fields, that is the origin (or the 
target) of a foreign key, it searches a schema for all the groups of fields that can be the 
target (or the origin) of the first group. The search is based on a combination of match- 
ing criteria such as the group type, the length, the type and the name of the constructs. 

Other reverse engineering functions use three specific program understanding proc- 
essors. 

• A pattern matching engine searches a source text for a definite pattern. Patterns are 
defined into a powerful pattern description language (PDL), through which hierar- 
chical patterns can be defined. 

• DB-MAIN offers a variable dependency graph tool. The dependency graph itself is 
displayed in context: the user selects a variable, then all the occurrences of this var- 
iable, and of all the variables connected to it in the dependency graph are coloured 
into the source text, both in the declaration and in the procedural sections. Though a 
graphical presentation could be thought to be more elegant and more abstract, the 
experience has taught us that the source code itself gives much lateral information, 
such as comments, layout and surrounding statements. 

• The program slicing tool computes the program slice with respect to the selected 
line of the source text and one of the variables, or component thereof, referenced at 
that line. 

One of the great lessons we painfully learned is that they are no two similar DBRE 
projects. Hence the need for easily programmable, extensible and customizable tools. 
The DB-MAIN (meta-)CASE tool is now a mature environment that offers powerful 
program understanding tools dedicated, among others, to database reverse engineering, 
as well as sophisticated features to extend its repository and its functions. 
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6 Conclusion 

In this paper, we have presented a generic methodology for the data structures extrac- 
tion process. We have shown that the schema refinement is a difficult task because it 
cannot be fully automated and it can be very different from one project to another. 

The role of the analyst is very important. Except in simple projects, (s)he needs to 
be a skilled person, who is competent in the application domain, in database design 
methodology, in DBMS’s and in programming language (usually old ones). 

One of the major objectives of the DB-MAIN project is the methodological and tool 
support for database reverse engineering processes. We have quickly learned that we 
needed powerful program analysis reasoning and their supporting tools, such as those 
that have been developed in the program understanding realm. We integrated these 
reasoning in a highly generic DBRE methodology, while we developed specific ana- 
lyzers to include in the DB-MAIN CASE tool. 

An education version is available at no charge for non-profit institutions (http:// 
www.info.fundp.ac.be/~dbm). 
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Abstract. This paper addresses the issue of documenting an existing 
legacy database by mining out its characteristics and derive the corre- 
sponding entity-relationship model. We developed algorithms to identify 
candidate keys of all relations in the relational schema, to locate the 
occurrence of a given candidate key as foreign key in any existing rela- 
tion, and to decide on the appropriate links (relationships) between the 
given relations. Based on the mentioned analysis, we draw a graph that 
corresponds to the entity-relationship diagram, and predicts all possible 
relationships between relations in the existing relational schema. Finally, 
we derive the cardinality of each link in the graph. 



1 A general overview 

In the absence of documentation, the understanding of a database design is 
easily lost when the developers disperse. This way, it becomes very difficult (if 
not impossible) to maintain and adjust an existing legacy database schema, and 
hence reverse engineering is necessary. The reverse engineering process derives 
the conceptual model from the conventional schema. The goal is to extract and 
know as much necessary information as possible about the conceptual model that 
led to the legacy database to be documented. This requires a detailed study of 
the existing legacy database. 

Our approach is summarized in the following steps. We start with the conven- 
tional database and derive the rules required to achieve the target. There are four 
choices to consider. The first choice depends on the existing database catalogue 
(meta-data) and an expert in the conventional system in deriving the rules. In 
the second choice, the database catalogue present in the conventional system is 
utilized to derive the rules; the success of the reverse engineering process depends 
on the correctness and completeness of the data dictionary. The third choice re- 
lies on the knowledge of an expert in the conventional system in order to feed 
the system with the required rules; the success of the reverse engineering process 
depends on how much details of the legacy database the expert is familiar with. 
Finally, when neither the data dictionary nor an expert is found, the only choice 
is to mine the rules out of the database contents. During the mining process, the 
characteristics of the conventional database are investigated in order to derive 
the necessary rules. This paper mainly handles the fourth case. Assuming that 
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the original developers were professionals who thoroughly understood the prob- 
lem, and designed a model that preserves database consistency and integrity, the 
degree of success depends on the effort put in the mining process. 

Regardless of which of the four choices is applicable, we need to derive a 
table that contains attribute(s) in the primary key of each relation within the 
given conventional relational database. Particularly in the fourth case, it is nec- 
essary to check all possible candidate key(s) of a given relation in order to decide 
on its primary key. The latter decision depends on comparing the domains and 
values of attribute(s) in a candidate key with other attributes in all existing 
relations. We try to find out which candidate key of a given relation is present 
as foreign key in other relations. This process leads to a table that shows the 
correspondence between attributes of different relations. The latter table is nec- 
essary and sufficient to draw a graph that summarizes all relationships between 
the relations present in the relational schema. This graph corresponds to the 
entity-relationship diagram (ERD). Nodes in the graph are relations and the 
number of edges connecting two nodes depends on the number of occurrences of 
the primary key of one relation as foreign key in the other; one edge corresponds 
to each occurrence. 

The rest of the paper is organized as follows. Related work is discussed in 
Section 2. The steps to be followed in extracting the information necessary to 
document a given legacy database are explained in Section 3; also included in 
Section 3 is an example relational schema to be referenced throughout the pa- 
per where illustrating examples are necessary. The graph that corresponds to 
the entity-relationship diagram is described in Section 4. Section 5 includes the 
conclusions and future work. 



2 Related Work 



Reverse engineering is considered to be the most vital stage in the whole re- 
engineering process. As described in the literature [1, 6, 7, 9, 11, 14, 17-19], a lot 
of the research efforts in the area of re-egineering concentrated on reverse engi- 
neering. Most of the research is based on using the relational schema as the basic 
input and aims at extracting the semantic information by an exhaustive anal- 
ysis of individual relations in the schema together with their key and non-key 
attributes. It is assumed that the schema input is at least in the third normal 
form; this easily allows the methodologies to identify candidate classes. The work 
described in [16] proposes three basic transformations that are repeatedly ap- 
plied to produce the re-engineered object-oriented schema. The main objective 
is to preserve a one-to-one correspondence between relations and object- types. 
Two other similar approaches are described in [9, 10] and [13]; both approaches 
handle the establishment of generalization hierarchies. The re-engineering ac- 
tivities at CIGNA Corporation is described in [8]. It covers all aspects of the 
undertaken system re-engineering process and also identifies the basic lessons 
learned from the first five years of the process. 
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Another group of researchers depend on query languages to extract some of 
the semantic information stored in a relational database [6,17]. The approach 
discussed in [6] extracts a conceptual schema by analyzing equi-join statements. 
The join conditions are used to represent edges in the schema and the use of 
the keyword distinct in a query implies non-unique values for the following at- 
tribute(s); these attributes are eliminated while deciding on the key. The work 
described in [17] starts from the database schema gaining knowledge about rela- 
tion names and the attributes. After that, the semantic information for the rele- 
vant relations is extracted from the available queries by investigating auto-join, 
set operations and where-by clause statements as well as the equi-join state- 
ments. 

The approach described in [7, 18] utilizes data instances to derive some of the 
semantics of the database application. It is actually an informal process requir- 
ing a lot of involvement from the user. Primary keys are identified by looking for 
unique indices and foreign keys are utilized to derive generalization and aggre- 
gation relationships. The approach described in [20] handles the re-engineering 
process by applying two steps. The first details the classification of relations to 
reflect the different object-oriented constructs, whereas the second step consists 
of applying a set of rules to generate these constructs based on the different in- 
formation contained in local databases. Finally, some other approaches deal with 
the design of re-engineering architectures for database applications [15, 19], The 
work proposed in [15] is particularly useful to understand the requirements for 
reverse engineering support. It only details a generic process model and the main 
schema transformation useful for the re-engineering process without proposing 
any specific algorithm. The work presented in [19] is based on the identification 
of schema, primary keys, SQL and procedural indicators. 

3 Mining out the Necessary Information 

In this section, we present our approach to extract from the legacy database the 
information necessary to continue with the documentation process. We do not 
assume any prior knowledge about the data analysis phase. First, consider the 
relational schema given next in Example 1. Where deemed necessary, we will 
refer to example 1 to illustrate the steps of the presented approach. 

Example 1 (Example Relational Schema) 

Person{SSN, name, city, age, sex, SpouseSSN, CountryName) 

Country (Name, area, population) 

Student{StudentI D , gpa, PSSN, DName) 

Staf f{Staf fID, salary, PSSN, DeptName) 

ResearchAssistant{StudentI D , Staf f ID) 

Course{Code, title, credits) 

Prerequisite{CourseCode, PreqCode) 

Takes{Code, StudentID, grade) 

Department {Name, HeadID) 

Secretary{SSN, words /mimute, DName) □ 
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The success of the documentation process is directly proportional to the depth 
of understanding of the existing legacy database. The more the design is un- 
derstood, the more information can be extracted and hence the more the whole 
process is successful. The information that we are looking for is summarized by 
the following tables: 

CandidateKeys(r elation name, attribute name, candidate key 
PrimaryKeys{relation name, attribute name) 

ForeignKeys{PK relation, PK attribute, FK relation, FK attribute. Link#) 

The first table, CandidateKeys, includes all possible candidate keys of the 
relations present in the relational schema. Attributes that constitute one candi- 
date key for a given relation are assigned the same sequence number. This way, 
it is possible to keep track of having the same attribute participating in more 
than one candidate key. Present in PrimaryKeys are only attributes that con- 
stitute primary keys of the given relations. Finally, ForeignKeys table includes 
all pairs of attributes such that the first attribute is part of a candidate key in a 
certain relation and the second attribute is part of a foreign key, a representative 
of the first attribute within any of the relations. Link# is there to differentiate 
between different foreign keys in the same relation. Foreign keys are numbered 
such that all attributes in the same foreign key are assigned the same number. 

Recall that, the source of the information to be included in the above men- 
tioned tables might be the database catalogue and/or an expert who knows 
about the details of the existing legacy database. The rest of this section is ded- 
icated to the case of having both sources missing and hence the information is 
extracted by analyzing the contents of the existing relations. 

The first table to derive is CandidateKeys, based on which the other two 
tables can be derived. To derive candidate keys of any relation R in the relational 
schema, elements of the powerset of the set of attributes in relation R, denoted 
P{R), are considered by Algorithm 1, given next. 

Algorithm 1 (Find Candidate Keys) 

Input: A table R and P{R), the powerset of the set of attributes in R. 
Output: All possible candidate keys of R are added to CandidateKeys table. 
Steps: 

Consider elements of P{R) in ascending order, according to their number 
of attributes; 

Let Candidate Key # = 1; 

For every s in P{R) do 

/*Join R with itself based on the equality of each attribute in s with itself.*/ 
Let T = Select * from R X, RY where X.s = Y.s; 

If size{R) = size{T) then 

For i\=\ to n do /*where n is the number of attributes in s*/ 

Add {R, Si, Candidate Key #) to CandidateKeys; 

/*Here Sj denotes element (attribute) i in set s.*/ 

EndFor 

Delete from P{R) any element that includes s; 

Candidate Key# = Candidate Key # + 1; 
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Endlf 

EndFor 

EndAlgorithm. □ 

Algorithm 1 determines all possible candidate keys of the input relation R. 
The basic idea is to check whether some attributes have unique values in each 
tuple of the input relation R. A set of attributes is accepted as a candidate key 
of R if and only if each tuple in R joins only with itself based on the equality of 
these attributes. In our implementation, the performance of Algorithm 1 can be 
improved by considering in the input only a sample subset of tuples in R. The 
larger the utilized sample, the better is the obtained result. 

Example 2 (Find Candidate Keys) 

Let us execute Algorithm 1 to find candidate keys of the Student relation. 

P(Student) = {{StudentID}, {gpa}, {PSSN}, {Dname}, {StudentID, gpa}, 
{StudentID, PSSN}, {StudentID, Dname}, {gpa, PSSN}, 

{gpa, Dname}, {PSSN, Dname}, {StudentID, gpa, PSSN}, 
{StudentID, gpa, Dname}, {StudentID, PSSN, Dname}, 

{gpa, PSSN, Dname}, {StudentID, gpa, PSSN, Dname}} 
Elements of P{Student) are considered in ascending order according to their 
number of attributes. For instance, to check whether StudentID is a candidate 
key, the following SQL statement is executed to join the Student relation with it- 
self. 

T — Select * from Student X, Student Y where X. StudentID = Y.StudentI D\ 
As a result of this query, StudentID is recognized to be a candidate key because 
size{T) ~ size(Student). Based on this, all elements of P(Student) that contain 
StudentID as an element are excluded from consideration in the rest of the 
search for candidate keys of the Student relation. The remaining elements of 
P(Student) are checked by the same way, and only PSSN is proved to be 
another candidate key for the Student relation. □ 

Applying Algorithm 1 to the relational schema introduced in Example 1 
returns the candidate keys in Table 1(a). By considering the third column in 
Table 1(a), it is obvious that some relations have more than one candidate key. 
In general a relation may have a set of candidate keys one of which is chosen 
as the primary key. All relations, except Takes and Prerequisite, have candidate 
keys each consists of a single attribute. After all candidate keys are located, it is 
time to investigate their presence in other relations as foreign keys. To achieve 
that, we implemented Algorithm 2 that finds out, for each relation, which of the 
candidate keys returned by Algorithm 1 is present in other relations as foreign 
key(s). This step is necessary to decide on the links between relations because 
the rest of our work is based on these links. 

Algorithm 2 (Find Candidate Foreign Keys) 

Input: A relation R, the AttributeList table and all possible candidate keys of 
table R. 

Output: Attributes in foreign key(s) that represent a candidate key of R in any 
relation including R, are added to the ForeignKeys table. 
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Relation Name 


Attribute Name 


C andidate Key # 






Person 


SSN 


1 






Country 


Name 


1 






Student 


StudentED 


1 








Student 


PSSN 


2 




Relation Name 


Attribute Name 


Staff 


StafflD 


1 




Person 


SSN 


Staff 


PSSN 


2 




Country 


Name 


Research Assist ant 


StudentID 


1 




Student 


StudentID 


Research Assist ant 


StafflD 


2 




Staff 


StafflD 


Course 


Code 


1 




Research Assist ant 


StudentID 


Prerequisite 


CourseCode 


1 




Course 


Code 


Prerequisite 


PreqCode 


1 




Prerequisite 


CourseCode 


Takes 


Code 


1 




Prerequisite 


PreqCode 


Takes 


StudentID 


1 




Takes 


Code 


Department 


Name 


1 




Takes 


StudentID 


Department 


HeadID 


2 




Department 


Name 


Secretary 


PSSN 


1 




Secretary 


PSSN 



Table 1. a) Candidate Keys: Candidate keys of all the relations in Example 1 
b) Primary Keys: Primary keys of all the relations in Example 1 



Steps: 

Consider candidate keys of R in ascending order, according to their number 
of attributes; 

For every candidate key ck of R do 

Consider relations in the relational schema in ascending order, according to 
their number of attributes; 

For every relation S in the relational schema do 

Consider from P{S) only elements s that contain the same number of 
attributes as candidate key ck and the same underlying domains; 

Let Linkjf^ — 1; 

For every s do 

Let Pi = Select s from R, S where cfc = s; 

Let P 2 = Select s from S; /*That is, project S on attributes in s.*/ 
Let T = P 2 — Pi', 

If sizeiT) — 0 and size{P 2 ) > 0 then 

/* Attribute(s) in ck correspond to attribute(s) in s, i.e., ck is a candidate key 
and s is a foreign key.*/ 

For i:—l to n do /*where n is the number of attributes in ck*/ 
Add the tuples (R,cki,S,Si,Link//) to ForeignKeys table; 

/* Here cki and Sj refer to corresponding attributes in ck and s, respectively.*/ 
EndFor 

Link # = Link # + 1; 

Endlf 

EndFor 
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EndFor 

EndFor 

EndAlgorithm. □ 

Algorithms 2 decides on the presence of candidate keys of a given relation R 
as foreign keys within any relation in the relational schema including relation R 
itself. Relation R is considered for the possibility of having its candidate key 
present within its attributes as foreign key. This way, cycles, i.e., self references 
within the model are detected. 





1 Foreign Kev attributes I 


1 Link# 


Relation Name 


Attribute Name 


Relation Name 


Attnbutc Namc| 


Pers on 


SSN 


Student 


PSSN 


1 


Person 


SSN 


Staff 


PSSN 


1 


Person 


SSN 


Secretary 


PSSN 


1 


Person 


SSN 


Pers on 


SpouseSSN 


1 


Country 


Name 


Person 


CountryName 


1 


Student 


StudentED 


Research Assist ant 


StudentID 


1 


Student 


StudentID 


Takes 


StudentID 


1 


Staff 


StafHD 


Research Assist ant 


StaffID 


1 


Staff 


StaffID 


Department 


HeadID 


1 


Course 


Code 


Prerequisite 


CourseCode 


1 


Course 


Code 


Prerequisite 


PreqCode 


2 


Course 


Code 


Takes 


Code 


1 


Department 


Name 


Student 


Dname 


1 


Department 


Name 


Staff 


DeptName 


1 


Department 


Name 


Secretary 


DName 


1 



Table 2. ForeignKeys: Candidate keys and their corresponding foreign keys 



It is necessary for Algorithm 2 to detect and locate all occurrences of a certain 
candidate key as foreign key within a given relation because it is possible to have 
two relations connected by more than one relationship. Multiple foreign keys 
within the same relation are numbered to differentiate attributes participating 
in each foreign key. One important point to consider here is that foreign keys 
within the same relation must be pairwise disjoint. This is reflected into the 
implementation of Algorithm 2 to improve its performance by excluding from 
consideration, as candidates for foreign keys, any set of attributes a subset of 
which is part of one of the already detected foreign keys. Table 2 is ForeignKeys 
obtained as a result of executing Algorithm 2 with the relational schema in 
Example 1 as input. 

The information included in ForeignKeys table is sufficient to proceed with 
the documentation process. Using this information, all possible links between 
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relations present in the given relational schema can be derived. The links form 
the basis for the remaining steps that lead to the required ERD. 

The PrimaryKeys table is derived as a by-product of Algorithm 2. For rela- 
tions that have only one candidate key, that candidate key is added to Prima- 
ryKeys. For relations that have multiple candidate keys, the primary key is se- 
lected to be the candidate key that appears in the ForeignKeys table. Table 1(b) 
include the primary keys of all relations in the relational schema in Example 1; 
notice that Table 1(b) is subset from the first two columns of Table 1(a). 

4 The Relational Intermediate Direct Graph 

In this section, we concentrate on how to benefit from the information present 
in the ForeignKeys table to decide on the links between the relations present 
in a given relational schema. For this purpose, a graph that includes all possi- 
ble relationships is generated. What we call the Relational Intermediate Direct 
Graph (RIDG) is constructed based on the information present in ForeignKeys. 
Nodes in RIDG are relations and two nodes are connected by a link to show 
that a foreign key in the relation that corresponds to the first node represents 
the primary key of the relation that corresponds to the second node. Nodes and 
links are represented by small rectangles and directed arrows, respectively. More 
formal details related to RIDG are included in Definition 1, given next. 

Definition 1 (Relational Intermediate Direct Graph) 

Given a relational schema and the related ForeignKeys table, the corresponding 
RIDG is a directed graph defined based on the information present in For- 
eignKeys. The set of vertices V and the set of edges E in RIDG are deter- 
mined by: 

1. V includes every relation R that appears in ForeignKeys. 

2. Let Ri and R 2 be two nodes in V, an edge (R 2 ,Ri) is added to E if and 

only if there exists a tuple {Ri,x, R 2 ,y,i) in ForeignKeys table, where x is 
an attribute in Ri, y is an attribute in R 2 , and i is an integer. The edge 
(R 2 ,Ri) is directed from R 2 to i?i and only one edge (R 2 ,Ri) is added to 
E for every primary key and foreign key pair. In other words, the number of 
links from R 2 to R\ depends on the number of occurrences of the primary 
key of Ri as foreign key in R 2 , and not on the number of attributes in a key, 
i.e., there is only one link {R 2 ,Ri) for each value of i. □ 

Shown in Figure 1 is the RIDG graph derived from the information present in the 
ForeignKeys in Table 2. Notice that, two links are directed from Prerequisite to 
Course because the primary key of the Course relation has two corresponding 
foreign keys in the Prerequisite relation. Actually, the RIDG graph shown in 
Figure 1 contains almost the same information found in the corresponding ERD. 
Left is to find the cardinality of each link in RIDG and this is the subject of 
the next section. 
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Fig. 1. The RIDG Graph of the relational schema in Example 1 



4.1 Deciding on the Ccirdinalities of Links 

It is necessary to decide on the cardinalities of the links in RIDG in order to be 
able to identify is-a links. In general, only some links with one-to-one cardinality 
are identified as is-a links. 

Links in RIDG are classified according to their cardinalities into two groups, 
one-to-one and many-to-one. This classification is based on the definition of 
RIDG where a link is directed from i?2 to R\ to reflect the presence of the 
primary key of R\ as a foreign key in i?2- By definition, a primary key must 
have a unique value for each tuple in Ri, and a set of tuples from R2 may have 
the same value for the corresponding foreign key. Based on this, the relationship 
between R\ and R2 becomes a mapping from R2 to R\ after neglecting tuples in 
R2 that have the value nil for the foreign key which corresponds to the primary 
key of Ri- In other words, it is impossible to have two tuples from R\ related 
with the same tuple in R2 by considering one foreign key. The cardinality of a 
link or relationship is one-to-one if and only if at most one tuple from R2 holds 
the value of the primary key of a tuple from R\. Otherwise, the cardinality is 
classified as many-to-one. The cardinalities of all links in a given RIDG can be 
determined by running Algorithm 3, given next. 

Algorithm 3 (Find Cardinalities of Links in RIDG) 

Input: A relational database and the corresponding RIDG. 

Output: The cardinalities of all links in RIDG. 

Steps: 

For every edge {R2, R\) in RIDG do 

Let pk be the primary key of Ri and fk be a corresponding foreign 
key in R2] 
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Let R = Select * from R 2 where the value of fk is not nil; 

/* To Project R on fk with duplicates eliminated.*/ 

Let P = Select distinct fk from R; 

If size{P) < size{R) then 

The cardinality of link {R 2 ,R\) is many-to-one; 

Elself size{P) = size(R) then 

The cardinality of link (R 2 ,Ri) is one-to-one; 

If size{R 2 ) = size{R) then 

Link (R 2 ,Ri) may be is-a link; 

Endlf 

Endlf 

EndFor 

EndAlgorithm. □ 

The cardinalities of all the links in the RIDG graph shown in Figure 1 can 
be determined by executing Algorithm 3 with the relational database in Ex- 
ample 1 as input. This terminates the documentation process because all the 
characteristics of the conceptual model that corresponds to the given relational 
database have been derived. However, an extra step can be executed to decide 
on the is-a hierarchy leading to a more natural data model; this is handled in 
the next section. 

4.2 Classifying Links 

In this section, we decide on is-a links. For that purpose, we define a table 
that includes all links which are candidates to be in the is-a hierarchy. What we 
call the Generalization table includes pairs of relation names such that the first 
relation is a candidate to be a sub-relation of the second. Tuples are included 
in Generalization by considering only one-to-one links which were identified by 
Algorithm 3 as candidates to be is-a links. An alternative to Algorithm 3, is 
to classify links based on the contents of the two tables CandidateKeys and 
ForeignKeys. A link is considered as a candidate to be is-a link if and only if the 
foreign key in the link is also a candidate key of its relation. According to this 
classification. Table 3(a) includes the initial Generalization table related to the 
relational schema in Example 1. 

We rely on an expert to suggest which tuples in Generalization correspond 
to actual is-a links. The current implementation relies on the user as the expert 
or the agent who suggests links to be in the is-a hierarchy. It is planned to 
have an autonomous agent to help in this decision making process. An agent 
is given the opportunity to ask for further explanatory information that may 
help to take the right decision. In the current implementation, the available 
explanatory information includes tuples from both relations as well as related 
primary and foreign keys of both relations. Providing helpful information to the 
agent is necessary because an agent should not depend on names of relations to 
make his/her suggestion; on the contrary, names of relations may be misleading. 
Concerning the example relational database and the related possible is-a links 
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Sub-Rdation 


Snpa'-Relatioti 


Student 


Person 


Staff 


Person 


Secretary 


Person 


Department 


Staff 


ResearchAssistant 


Student 


Research Assist ant 


Staff 



Sub-Relation 


Suoer-RelatiDn 


Student 


Person 


Staff 


Person 


Secretary 


Person 


ResearchAssistant 


Student 


ResearchAssistant 


Staff 



(a) (b) 

Table 3. Two versions of the Generalization table: 

a) A list of all possible sub-relations and super-relations 

b) Optimized list of sub-relations and corresponding super-relations 



present in Table 3(a), it is suggested that the link (Department, Staff) is not 
part of the is-a hierarchy. The corresponding optimized Generalization table is 
Table 3(b). This way, only the actual is-a links are left in Generalization. 



5 Conclusions and Future Work 

In this paper we presented a novel approach to document a conventional rela- 
tional database. We developed some algorithms that build the necessary knowl- 
edge based on a successful understanding of the structure and characteristics of 
an existing conventional relational database. This way, we provide in the imple- 
mentation two opportunities; either the user feeds the system with the required 
understanding in case it is available, or leave the system to investigate and mine 
out the characteristics of the conventional relational database with little help 
from the user. In the current implementation, the user help is expected in de- 
cisions that require some semantic reasoning which cannot be inferred from the 
conventional database contents. Our plan is to minimize the dependency on users 
by developing and providing autonomous agents that can help in this decision 
making process. 

Finally, we realized that mining out the characteristics of a conventional 
database has some other advantages. In addition to documenting an existing re- 
lational database, it also provides the possibility of moving to an object-oriented 
database. Currently, we are concentrating on forward engineering to an object- 
oriented database management system, benefiting from our previous work and 
experience with object-oriented databases [2-5]. Our aim is to develope algo- 
rithms for successful “bulk loading” of data from a relational database into an 
object-oriented database. We are also investigating the equivalence of two exist- 
ing database schemas; one is relational and the other is object-oriented. We are 
motivated by having two generations of informations systems to serve the same 
application. 
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Abstract. This paper describes a method aiming at the extraction of 
gcncralizatiun/spccialization hierarchies contained in a relational database. This 
reverse engineering approach takes advantage of two major characteristics ; 
first, we use DDL and DML specifications as well as data in a combined way, 
secondly, we provide not only generalization/specialization hierarchies but also 
integrity constraints allowing us to elicit the generalization/specialization links 
hidden in the structures and instances of the database. The result of the process 
consists of an enriched conceptual representation of the relational database. This 
approach is mainly based on heuristics. The heuristic rules map a relational 
meta-model onto a conceptual one. They are divided into three categories : 
semantics suspicion rules, reinforcement rules and confirmation rules. We 
illustrate our approach using a fairly complex cxam|>lc. Some extensions arc 
discussed. 



1 Introduction 

We have seen in recent years an unprecedented rate of development of reverse 
engineering approaches. The main forces behind this popularity are increasing costs 
of old software maintenance. Maintenance costs are high mainly due to the changing 
environment of the firms and faulty systems analysis, especially information 
requirement analysis [1]. Reverse engineering aims at lowering the effort that goes 
into modifying and enhancing existing software systems. In these software systems, 
databases arc the most stable part. Recovering the conceptual schema of these 
databases is the main objective of reverse engineering approaches. 

As stated in the beginning of this section, the purpose of this paper is to describe 
the development and the application of an approach for the conceptualization phase of 
relational database reverse engineering. Hiis methodology allows the designer to 
extract the conceptual schema of the relational database using the schema as described 
in the Data Definition Language (DDL), the queries and/or programs expressed with 
the Data Manipulation Language (DML). DDL and DML specifications examination 
is followed by a data exploration process. Reverse engineering rules are defined on 
two different meta-models. The first meta-model represents the conceptual schema 
elaborated during the reverse engineering phase. This meta-model is called the target 
schema. The second meta-model represents the relational base including DDL 
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specifications and summary data resulting from DML and data exploration. This 
meta-modcl is called the source schema. 

More specifically, this paper describes the reverse engineering of generalization 
hierarchies. The latter which are not explicitly represented in the relational model are 
an important concept of the extended fiR model, allowing the designer to achieve a 
more precise description of reality. However, the generalization hierarchies can be 
expressed using other means such as null values in the tuples, inclusion constraints 
and view definitions in the DDL specifications, and queries in the DML 
specifications. Extracting the generalization hierarchies brings three main advantages. 
First it permits us to discover hidden entities. Secondly, it is very useful in the context 
of software migration, especially toward object-oriented databases. Thirdly, it 
facilitates schema integration of heterogeneous databases. 

The rest of the paper is organized as follows. Section 2 is dedicated to a state-of- 
the-art of methodologies and approaches related to database reverse engineering with 
a particular emphasis on those which deal with extraction of generalization 
hierarchies. Section 3 describes our approach called McRCl (a French acronym for 
Intelligent Reverse Engineering Method). Section 4 is dedicated to reverse 
engineering of generalization hierarchies. The four main representations of 
generalization hierarchies using the relational model are synthesized. Their detection 
is detailed and illustrated in Section 5. An example including the four cases is 
analyzed. Section 6 concludes and describes further research. 

2 State of the Art 

Reverse engineering refers to the process of discovering how a software system 
works. Its aim is to recover system components and their relationships. It 
encompasses a wide range of tasks related to the creation of a high-level abstraction 
of software systems. Database reverse engineering consists of extracting conceptual 
schemas from databases (hierarchical, network, relational or object-oriented). It 
consists of schema transformation. 

Prior research in reverse engineering can be split into three main generations. The 
first generation proposed reverse engineering methods for conventional files, mainly 
Cobol-type files [2]. The second generation focuses on transformation of navigation^ 
database schemas (hierarchical and Codasyl-type) [3]. 

The third generation of research in database reverse engineering, devoted mainly 
to relational and object databases, can be split into three main streams. One that 
formalizes the reverse engineering process using algorithms [4,5]. Another that 
performs a heuristic reverse process (6,7,8, 9]. The third stream concerns the 
transformation of relational databases into object oriented ones [10,1 1,12]. 

Depending on the approach, different information sources (Data Definition 
Language, Data Manipulation Language, Data, User) are used during the reverse 
engineering process. To the best of our knowledge, only three approaches used all the 
sources available to perform a comprehensive reverse engineering process [6,8,9]. 
The conceptualization framework presented in this paper uses in more selective and 
intensive way all the sources. By applying a set of reverse engineering rules applied 
during the mapping process of relational meta-model to a conceptual meta-model, we 
are able to extract the conceptual EER schema. In the next section, we briefly describe 
the main principles of this method. 

Some of these reverse engineering approaches deal with the elicitation of 
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generalization/specialization hierarchies [11,12,13]. These identification techniques 
allow the reverse engineer to exhibit generalization/specialization links between 
relations as well as between entities represented by a common relation. The 
identification is based on the DDL dc.scription and/or on data. However, to the best of 
our knowledge, they are limited to one level of hierarchy. Moreover, they do not use 
the DML specifications which also contain semantic information, such as view 
definition or cursor description. 

We propose to combine the three following sources : DDL specifications, DML 
specifications and data and to use them in a selective way. Using reverse engineering 
rules, we produce an EER conceptual schema including generalization hierarchies. 
The following section describes our general reverse engineering framework. Section 4 
will focus on the detection of inheritance links. 

3 The MeRCI Approach 

MeRCI, a French acronym for Intelligent Reverse Engineering Method (Methode 
dc RdtroConception Intclligcnlc) aims at the transformation of a relational database 
physical and logical schemas into a conceptual schema, starting mainly from the 
source codes of the application (DDL and DML) and if necessary by exploring the 
data. The study of the physical schema and its associated SQL programs allows us to 
identify relevant information translated into a knowledge fact base using the reverse 
engineering rule base. This intelligent translation process leads to a conceptual 
schema. Based on an expert system, it proposes alternative solutions to be validated at 
the end of each reverse engineering phase by the database designer. Interactions with 
the designer take place at the end of each step. 

The main steps of MeRCI are : 
a) Extraction of the physical schema. 

The aim of this step is to obtain a complete and detailed description of the physical 
schema. 

h) De-optimization of the physical schema 

Starting from the physical schema obtained at the end of the preceding step, we apply 
a set of physical reverse engineering rules to de-optimize the physical schema [14]. 

c) Conceptualization of the logical schema 

This step is detailed in [IS], The entity, relationship, cardinality, and generalization 
identification process is performed : 

- using all the inforinalion extracted from the de-oplimized physical schema, 

- examining the embedded SQL source code to determine identifiers (primary keys, 
unique columns, unique index and possible candidate keys), 

- searching for synonyms, homonyms, foreign keys, 

- looking for referential integrity constraints, 

- identifying multi-valued attributes and decode tables, 

- detecting generalization / specialization links. 

d) Semantic description 

Starting from the conceptual schema obtained by applying the step described above 
and using some hypotheses formulated on the conceptual meaning of the entities, 
relationships and generalizations obtained, we characterize the data objects of the 
universe of discourse. 
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Formalizing reverse engineering rules requires a language allowing us to describe: 

- the conceptual schema to be extracted (target schema), 

- the relational database to be reversed (original schema), including the DDL 
spccitications, the requests and the DML specifications, and finally the extension of 
the database (the data), 

- the mapping of the relational database to the conceptual schema . 

To achieve such a result, we have developed : 

- a meta-model of the target conceptual schema allowing us to represent the concepts 
which have to be discovered during the conceptualization phase, 

- a meta-model of the relational database which includes: 

• the detailed schema specifications extracted from the DDL, 

• the information extracted from the DML specifications and presented in an 
aggregate form, 

• the results of the set operations performed on the database tuples, 

- a set of production rules linking together the concepts of the two meta-models 
associated to certainty factors. 

These components of the reverse engineering conceptualization framework are 
described in some details below including (he rules performing the mapping of the 
relational mcta-modcl to the conceptual mcta-modcl. The two meta-models are 
expressed using HER concepts. 

3.1 The Target Conceptual Schema Meta-Model 




Fig. I - The target conceptual schema meta-model 



Converting any relational database .schema into an EER schema requires a generic 
description of the latter. In other words, the transformation of any relational schema 
can be performed only if we have a model of the EER model. Therefore, a conceptual 
meta-model is needed. An EER schema meta-model is presented in Fig. 1. The main 
entities of this meta-model are (he entity and the relationship linked to an object, 
representing the generic concepts of the model. A generic object is limited to one or 
several attributes and to one or several identifiers. Attribute and identifier may be 
either atomic or composed of atomic attributes. Attribute and identifier are modeled 
as weak entities since their existence and their identification depend on the object (i.e. 
entity or relationship) to which they belong. The entity and the relationship are 
connected by "links" characterized by the cardinality and by the role played by the 
participating entities. Generalization hierarchies between entities are represented by 
the inheritance relationship. 
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In order to identify all the concepts described in the conceptual schema meta-model, 
reverse engineering rules will be applied to the relational database meta-model 
described below. 

3.2 The Relational Meta-Model 

The relational meta-model is used to store information generated during the 
exploration of the relational database. It contains semantic information extracted from 
one of the sources used : the DDL specifications of the database, the DML 
specifications included in programs, or the data. Fig. 2 contains the meta-model 
generated during the reverse engineering process. The relational tables contain 
columns described by a name, an identifier-similarity which is valued as true if the 
name of the column looks like an identifier (containing strings code, id, #, num, etc.), 
the domain or data type, mandatory if the column is defined as not null. 




Groups of columns may be defined as : 

- index, unique or not, 

- foreign keys if a clause REFERENCES or FOREIGN KEY is found, 

- identifiers, if a clause UNIQUE or PRIMARY KEY is met. 

Moreover, the meta-model contains the description of existence constraints, which are 
very useful to identify generalization hierarchies. We take into account non-null, 
mutual and exclusive existence, conditioned existence, intersection and inclusion 
constraints [16]. A relational table may represent several entities. In such a case, null 
values are used and constraints may be defined selected to these null values. They are 
existence constraints defined between tuples. We consider four kinds of constraints : 
non null, mutual existence, exclusive existence or conditioned existence. A non null 
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constraint is established on a subset of columns in a relation whieh cannot be not 
valued. They are mandatory attributes. A mutual existence between two columns x 
and y of a relational table, noted x <->y, is dcHned when, for each tuple, x and y arc 
both valued or both null. An exclu.sivc existence, noted x</>y, is defined between x 
and y if, when x is not null, y is null and vice-versa. A conditioned existence 
constraint is defined between x and y, noted x — »y when if x is not null, then y is not 
null. We spell ; x requires y. 

By analyzing DML specifications, in cursors, programs or procedures, we are able 
to feed other parts of the meta-model (Fig. 2). Queries arc stored in the entity Query. 
A name is attributed (program name and line number, by default). The queries are 
also characterized by their type (join, union, join + union, .select, etc.). Views are 
considered as queries (with view=ycs). DML analysis allows us to detect if tables can 
be decomposed into hierarchies of cntilic.s. Groups of columns may be : 

- access keys if there exist selections based on values of these columns, 

- potential keys, i.c. access keys which bring at most one tuple (for example, in the 
programs, the selection is never followed by a loop on a set of tuples). 

Query analysis, mainly join conditions, allows us to suspect homonyms or .synonyms 
between column names. 

Finally, information deduced from table exploration is also represented in Fig. 2. 
First, we detect null values. Secondly, for groups of columns identified previously, 
common values, common value sets, inclusion of value sets and possible existence 
constraints (mutual, exclusive, conditioned) are detected. Of course existence 
constraints cannot be deduced from data but only suspected. 

In order to describe the reverse engineering of generalization hierarchies, we 
describe in the following section, the different ways to express these hierarchies in 
relational bases. 

4 Reverse Engineering of Generalization Hierarchies 

The relational model does not express explicitly generalization hierarchies. 
However, the existence of generalization semantics can be often deduced using the 
different sources of information about the data base. In this section, we describe the 
different ways to translate gencralization/spccialization links in the relational model 
and give some clues to delect them. 



4.1 Kclatiunal Translation of Gencralizalion/Spccialization Links 

The different ways to express generalization links can be summarized in four 
elementary cases. Their combination can lead to multiple situations : 

- Case 1 where specific entities are ignored : The specific attributes are assigned to 

the generic entity. 

- Case 2 where the generic entity is ignored ; The generic characteristics are 

transferred to all specifics. 

- Case 3 where both specific and generic entities are translated and the key is 

duplicated. The same key is used to establish the link between the two tables. 
A similar case could occur where an identifier would be the primary key in 
one table and a candidate key in the other table. 
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- Case 4 where both specific and generic entities are translated and a mapping table is 
used. 

In the following paragraph, we describe how these four cases can be detected by the 
analysis of information available in the database. 

4.2 Detection of Generalization Links 

The detection of generalization links hidden in the relational database is performed 
using the three information sources; the DDL specifications, the DML specifications, 
and the data. 

a) Case 1 : only the generic table exists 

DDL analysis ; 

In such a case, there must exist a few columns not declared as NOT NULL in the 
DDL specifications. Specific attributes arc contained in the generic table, leading to 
null values for the other tuples. The specifications may also contain a mutual 
exclusive constraint between specific attributes. 

DML analysis : 

if specific entities arc of interest, programs contain queries selecting only those 
specifics inside the generic table. We call these queries alternative ones. 

Data : 

in this case, the specialization is represented by an attribute. This attribute can be 
detected because it has only a few possible values (two or three). Moreover, groups of 
columns have null values. Finally, tuple analysis allows the designer to suspect 
existence constraints between attributes. 

b) Case 2 : only the specific table exists 
DDL analysis : 

specific tables have key columns with the same names and the same data types. Other 
columns also may be common. However, here, key columns are different. 

DML analysis : 

in this case, some queries may contain a union operation on common columns or 
selection of these common columns. 

Data : 

if specific entities are overlapping, keys and inherited columns have common values. 

c) Case 3 : both generic and specific tables exist 
DDL analysis : 

in this case like in case 2, the specific tables and the generic tables generally have key 
columns with same names and same data types. 

DML analysis : 

programs must contain both union queries on specific tables and join queries between 
specifics and generic. 

Data : 

the generalization link may be identified by the detection of common values between 
key data sets. Their key data sets have a lot of common values. 

d) Case 4 : both generic and specific tables and a mapping table exist 

DDL analysis : 
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the mapping table contains only two columns which are respectively keys of the 
generic and the specific tables. Key columns must have the same data types and may 
have the same names. Reference constraints may be defined between key columns. 
DML analysis : 

Generic, specific and mapping tables arc used in join queries to link the information 
contained in these three tables. 

Data : 

The value sets of both columns in the mapping tables are included in the value sets of 
the other tables. 

These heuristics, used to detect generalization links in the relational tables, are 
illustrated through an example in the following section. 

5 Elicitation of Generalization Hierarchies : the MeRCI Approach 

Wc dc.scribc the example and then we illustrate the reverse engineering rules on it. 
5.1 Example 

The example contains a relational dalaba.se called University, characterized by the 
following relational schema ; 

PERSON f access-code lastname firstname birthdate address) 

FACULTY (facultv-id access-code recruitdate salary rank step group department 
researchteam-id) 

STAFF fstaff-code access-code recruitdate salary rank step group servicenumber) 
STUDENT ( stud-id access-code level academic-cycle researchteam-id) 
STUDENT-HIST fstud-id year result academic-cycle level) 
RESEARCH-TEAM (researchteam-id teamname labo-code team-manager budget) 
LABORATORY dabo-code labo-manager yearbudget) 

SERVlCE f servicenumber serv-manager supervising-service) 
DEPARTMENT fdepartment-code departmanager) 

STRUCTURE( structure-code structure-name address) 

S S fstructure-code servicenumber) 

S L fstructure-code labo-code) 

S D f structure-code department-code) 

The queries available are summarized below : 

Q1 ; Select employees recruited more than 20 years ago 
Q2 : Print all pay slips 
Q3 : Modify the rank of a faculty member 
Q4 : Select students of a given academic cycle 

Q5 : Select students who obtained an excellent result at the graduate level 
Q6 ; List permanent faculty related to a given department 
Q7 : List the hierarchical organization of services 
Q8 : List temporary faculty 

Q9 : List assistant professors linked to a given department 

QIO : List full professors whose rank has been 1 for more than 10 years 

Q1 1 : List all persons allowed to access the university 




Elicitation of Generalization Hierarchies 



181 



5.2. Reverse Engineering Rules 



To avoid a fastidious description of the reverse engineering rules, we have 
summarized the main families of rules in the decision table of Figure 3. 




Fig 3 - Decision table for generalization/specialhation hierarchies 



The first six rules (R1 to R6) detect inheritance links through DDL specifications. 
For example, if two tables have common keys (same names), a generalization link 
(case 2, 3 or 4) may be suspected (rule R I ). If a table contains a candidate key column 
which is also key of another table, cases 3 or 4 must be supposed (rule R2). If a table 
contains non mandatory columns, it may perhaps be split into several entities (rule 
R3). If several tables have keys with the same names and/or the same domains, cases 
2 or 3 have to be detected (rule R4). If a table contains only two foreign keys, it may 
be a mapping table (rule R5). Finally, if a REFERENCES constraint is specified 
between such tables, cases 3 and 4 are suspected (rule R6). 
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Rule R1 may be paraphrased in Ihe following way : 

IF table TI has a column C with a certainty equal to I 
And table T2 has a column C with a certainty equal to I 
And Tl is an entity with a certainty equal to 1 
And T2 is an entity with a certainty equal to 1 
And C is identifying TI with a certainty equal to / 

And C is identifying T2 with a certainty equal to I 
THEN there exists a generalization link between Tl and T2 with a certainty equal to 
pi. 

Certainty factors associated with the conditions and conclusions of the rules 
measure the level of confirmation of the information : for example, certainty pi 
means that the information is deduced from a unique source (either DDL or DML or 
data) [15], 

Rules R7 to R9 arc also based on DDL specifications exploration. More precisely, 
they rely on the definition of existence constraints. If two columns are mutually 
exclusive, there may exists an inheritance link (case 2 or 3 - rule R8). If they are 
linked by a mutual existence constraint and if a generalisation hierarchy is confirmed, 
lhc.se two columns arc to be inserted in the .same entity of the generalization hierarchy 
(rule R9). Finally, if column Cl requires column C2 (conditioned existence 
constraint) and if a generalization hierarchy is confirmed, C2 must be associated with 
the same entity than Cl or with a generic of it (rule R7). 

Rules RIO to R13 are based on DML specifications exploration. If a table is 
queried by several programs selecting either a group of columns or another group of 
columns (we call them alternative queries), an inheritance link is suspected (case 1 - 
rule RIO). If union queries between several tables are found, an inheritance link of 
case 2 is su.spected (rule R1 1). If there exist also join queries of tables Ti with a same 
table T, case 3 must be suspected (rule R12). If there is neither alternative query nor 
union query but only Join queries with a table only composed of two foreign keys, a 
generalization case 4 must be suspected (rule R13). 

Rules R14 to R17 perform data exploration to confirm one of the generalizations 
cases already suspected by the previous steps. For example, in order to confirm a 
generalization case 1 , one must verify that non mandatory columns effectively contain 
null values (rule R16). For generalization of case 3 or 4, we must check the existence 
of identical values between the key columns of the tables and the inclusion of these 
key values (rule R14 and R15). Finally, the existence of common values (intersection 
constraint) will allow us to confirm a generalization of case 2 or 3 (rule R17). 

Finally, rules R18 to R20 perform data analysis in order to suspect existence 
constraints when the latter are not described by the DDL specifications. Suspecting 
mutual existence constraint between two columns Cl and C2 requires to verify that 
there is no tuple for which Cl is null while C2 is not null and vice-versa (rule R20). 
Two columns (or groups of columns) can be supposed to be exclusive if there is no 
tuple for which these two columns are valued (rule R19). Finally, if a table T contains 
two columns Cl and C2 such as there is no tuple for which Cl has a value and C2 is 
null, a conditioned existence constraint (Cl requires C2) can be suspected (R18). 

5.3 Application of the Rules to the Example 

Rule R1 (homonymous keys) can be fired to the tables STRUCTURE, S_S, S_L and 
S_D to deduce an inheritance link (case 2,3 or 4) between these tables. In the same 
way, this rule can be performed on tables PERSON, FACULTY, STAFF and 
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STUDENT. Rule R2, applied to tables PERSON. FACULTY, STAFF and 
STUDENT, reinforces the suspicion of inheritance links (ca.se 2 or 3) between these 
tables. This contradicts the ca.sc 4 which is nevertheless su.spected for the tables 
STRUCTURE, S_S, S_L and S_D. Rule R4 will confirm a generalization link (case 2 
or 3) for tables FACULTY and STAFF on the one hand and for tables STUDENT and 
STUDENT-HIST on the other hand. Rule R, fired on tables S_S, S_L and S_D, 
reinforces the conclusion of rule R1 , leading to a probable inheritance link (case 4) for 
which S_S, S_L and S_D are mapping tables. If no existence constraint is specified, 
rules R7 to R9 do not apply. Queries Q6, Q8, Q9 and QIO will trigger rule RIO. They 
show that several queries perform alternative selections on several columns of table 
FACULTY, which leads to the specialization of this table into several sub-classes. 
Queries Ql and Q2 are union queries on tables FACULTY and STAFF. Query Q7 
requires joins operations between tables, S_S, S_L, S_D and the table STRUCTURE. 
This query will trigger rule R13 confinning an inheritance link (case 3 or 4) between 
these four tables. Queries Ql 1, Ql and Q2 allow Ihc system to trigger rule RI2 and 
confirm an inheritance link (ca.se 3) between the tables PERSON, FACULTY, STAFF 
and STUDENT. 



Person 






Student •♦-Student-hist 

Staff ^Faculty 1 



Staff+Faculty 

^^S_D 
Structure ◄ — S_L 
^ S_S 

Fig 4 - Application of the rules 



Faculty ^ — Faculty2 

V... 



At the end of this process, the generalization hierarchies obtained is represented at 
Figure 4. The links are not characterized by the same level of certainty, depending on 
the number of rules concluding to these links. In this example, the suspicions seem to 
be quite correct. However the link between Student and Student-hist (certainty pi) is 
not correct. It results from the application of rule R4. The confirmation phase, 
conducted by the human designer, will probably infirm this link because of its low 
certainty. This phase will also consist in naming the different new entities : 

- Staff+Faculty will be Employee, 

- Faculty 1, Faculty2,... will be given the names of different ranks of faculty. 

Let us recall that this conceptualization step not only identifies the generalization 
links but also the other concepts. Thus, it results in a whole EER conceptual model. 



6. CONCLUSION AND FURTHER RESEARCH 

This paper describes a method aiming at the extraction of generalization 
hierarchies contained in relational databases. This method, part of the MeRCl 
approach for reverse engineering of relational databases, has the following 
characteristics. First, it performs the extraction process using three combined sources 
of information, namely the DDL specifications, the DML queries which can be 
detected in program codes, procedures, available cursors and the data provided in the 
relational tables (database extension). Secondly, it is based on heuristics allowing the 
reverse engineer to detect the hidden concepts contained in the three sourees of 
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information. Thirdly, it generates the hidden generalization links existing between the 
entities of the conceptual model as well as existence constraints between the attributes 
of the entities. 

The main objective of the paiicr is to extract conceptual inheritance links, thus 
facilitating the migration of relational systems to object oriented environments. Our 
approach takes advantage of prior contributions in the domain and enriches it by 
adding new detection modes of generalization hierarchies. The decision table allows 
us to structure and to exhibit the reverse engineering rules leading to the extraction of 
the generalization hierarchies. It also offers a way to reach an internal validation of 
the rule set. The latter is obtained by exploring real life relational database tables, by 
analyzing previous approaches and by integrating rules provided through our 
experience. External validation is underway, in particular by exploring other relational 
tables. The implementation is at its final stage. It is performed using a software 
platform combining Smart Elements (Object-Oriented Expert System) and Oracle 
relational database. 

Further research will include methods and techniques to extract semantics 
contained in queries. Our method for generalization hierarchies elicitation can be 
extended to different kinds of databa.scs, namely hierarchical, network, object- 
oriented and multidimensional databases. 
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Abstract. Software evolution is an inevitable process for software sys- 
tems. Repeated changes alter the structure of a system, rapidly degrad- 
ing it and making the system ’’legacy”. Reengineering seems to be a 
promising approach to upgrade these systems according to the latest 
technologies. This paper describes a tool to reengineer procedural sys- 
tems written in Cobol, Fortran, C or Pascal, into object-oriented ones 
written in Smalltalk. The prototype developed identifies potential classes 
automatically, but allows user intervention to work up conflicts. 



1 Introduction 

A meaningful number of the systems used nowadays are often many years old and 
have become reliable over the years. These systems are called legacy and may be 
defined as large software systems people do not know how to cope with, but that 
are vital to organisations since they may be the only place where organisations 
business rules exist [3]. Usually, during the maintenance process, the structure 
and the documentation of systems deteriorate. Something has to be done to keep 
these systems up to date and the decision on what to do is critical because they 
may represent years of accumulated experience and knowledge. 

The object-oriented paradigm is the predominant software trend of the 
1990’s. According to literature, it provides a unifying model for various phases 
of development, facilitates system integration, allows prototyping, encourages 
software reuse, eases system maintenance and provides support for extensibility 
[21]. An object-oriented system is best developed starting with object-oriented 
analysis. Nevertheless, this may be difficult sometimes because of the existence 
of many legacy systems. Then, a need to reverse engineer and reengineer exist- 
ing legacy systems in order to keeping them up with the latest technologies has 
arisen. There has been a lot of work done to improve legacy code quality because 
it has a great impact on legacy systems comprehension, maintenance and evo- 
lution. All these efforts may be referred to as software reengineering activities 
[ 1 ]- 

Among the proposals aiming at improving software quality and understand- 
ing code restructuring, program modularization and migration from imperative 
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programs to object-oriented ones can be mentioned. Restructuring is one of 
the oldest and most refined reengineering techniques [1]. The modification of a 
program control structure to make it follow the rules imposed by structured pro- 
gramming, is one of its associated meanings. Many algorithms have been defined 
to restructure programs by introducing new variables in them but they always 
change program topology [5], [19]. In contrast, cliche-based methods can fail in 
unexpected situations [7], [16], [24]. 

Program modularization consists of decomposing a monolithic program or 
module and replacing it with a functionally equivalent collection of smaller mod- 
ules. Modules should have high cohesion and low coupling. Several methods have 
been defined to elicit functions from programs and, according to each work goals, 
these functions are analysed as candidates for reuse or to rewrite the program 
in a modular way [6], [9], [18]. Many of these works employ program slicing, 
a program decomposition method well suited for isolating functionalities in a 
program [27]. 

The migration from imperative programs to object-oriented ones points to 
construct a hierarchy of classes that perform the same computations as the orig- 
inal procedures. Each class encapsulates data methods for processing it. Several 
techniques have been proposed to identify object-like features in imperative pro- 
grams [13], [17], [20]. The one defined in [20] introduces two methods, one based 
on global and persistent data, and the other, based on the types of formal pa- 
rameters and return values. Other approaches pointed to programs written in 
a specific programming language, like Fortran [23], [26], Cobol [4], [25] or RPG 
[14] . All these works agree in that transforming an imperative program into an 
object-oriented one is a diSicult task, which cannot be completely automated. 

This paper describes the last step of a project whose aim is to develop a tool 
to transform legacy systems in order to simplify and improve their maintenance 
and understanding, taking benefit from object-oriented technology. 

As part of this research, a prototype has been developed which implements 
the algorithms to restructure, modularise and extract objects automatically. Hu- 
man intervention is allowed in order to improve the results. 



2 The Project 

This section presents a project that transforms legacy systems in order to en- 
hance their architecture (Fig. 1). To achieve this, it is necessary to capture and 
recover all the knowledge extracted from procedural programs and store it in a 
higher level structure, which can be analysed and manipulated. Prom this struc- 
ture, objects and classes are recognised and extracted to rewrite the program 
in an object-oriented language. Besides preserving the original functionality, the 
new code generated should be structured, legible, modular, reusable, and more 
easily maintainable. The only source of information is the procedural source 
code of the programs and its quality has a great influence on the quality of the 
recovered objects. To minimise this influence, programs are first syntactically 
restructured and modularised. 
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To represent all the information extracted from imperative programs two 
structures have been defined as Software Repository. One of them, called In- 
termediate Language (IL) [12], was specially designed to allow the tool to be 
used with programs written in different imperative languages. An advantage of 
this structure is that the algorithms designed to enhance code quality are in- 
dependent from the original source code. It also allows rewriting the improved 
programs in a language different from the original one. The languages considered 
are Pascal, Cobol, C and Fortran. The other structure is the Extended Program 
Dependence Graph (EPDG) [12], which is an extension defined over the Program 
Dependence Graph [15]. It was designed to store and manipulate in an easy way 
the information represented in the IL. 




Fig. 1. Outline of the project 



This tool is structured into three main steps. The first one, called Syntactic 
Restructuring Step [11], turns an arbitrary imperative program into a structured 
one by means of transformations defined over the EPDG, according to the rules 
imposed by structured programming. It does not modify the portions of the 
program already structured. The second one, referred to as Modularization Step 
[10] , decomposes a monolithic structured program into a functionally equivalent 
collection of smaller modules. A variant of output-restricted slice was defined 
to capture all relevant computations involving a given variable. Following this 
definition, a slice is constructed for each different use of each program variable. 
Starting from the lattice of the computed slices ordered by set inclusion, candi- 
date modules are automatically detected and extracted. Each potential module 
is then analysed considering some criteria before its implementation. The modu- 
larization algorithm ensures that the obtained slices have high cohesion and low 
coupling, as they compute a single output. Since the concept of object-oriented 
programming increases maintainability and reusability of systems, the last step, 
called Object-Oriented Conversion Step, aims at turning a structured and mod- 
ular imperative system into an object-oriented one. 
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The two first steps of the tool are completely implemented in a prototype 
that receives as entry an imperative program written in Pascal, Cobol, C or 
Fortran and produces as output a structured and modular program written in 
any of these languages. It performs all transformations automatically, although 
the user is given the possibility to specify desired values to the criteria considered 
since different systems, languages and users may have distinct criteria to define 
what a good module is. The user is also required to assign a meaning to the 
isolated modules. 

Although the last step is not totally concluded, a first prototype that trans- 
forms imperative systems written in Pascal into object-oriented ones written in 
Smalltalk, has been developed. This paper aims at describing the final step of 
this project. 



3 Object Extraction 

Since the concept of object-oriented programming increases maintainability and 
reusability of systems, the Object-Oriented Conversion Step, aims at turning 
a structured and modular imperative system into an object-oriented one. The 
method defined in this paper enhances and combines the two methods described 
in [20]. Its output is a set of potentieil classes (including the instance variables 
and methods). 

3.1 Object identification basic method 

In a conventional programming language, an ’’object” can be identified as a 
collection of routines, types, and/or data items [20]. The routines will implement 
the methods associated with the object, the types will structure the data it 
encapsulates or processes and the data items will represent the actual instances 
of the class. 

The candidate ’’objects” will be a list of three sets: 

Candidate Object = (F, T, D) where F is the set of routines, T is the set of 
types and D the set of data items. Anyone of these sets can be empty. 

The goal is to find a useful partial classification of routines, types and data 
items, meaningful in the context of the program and its real world domain. 

A large part of the information for this classification can be obtained 
analysing the relationships among the program components, but carefully se- 
lected heuristic or human intervention will be needed to eliminate casual or 
without sense relationships. 

Ideally, sets from different objects should not overlap, and then a routine, 
type or data item should not appear in more than an object. However, it is 
not required to the objects to be completely disjoint, since object identifica- 
tion methods should capture those situations in which efficiency considerations 
forced the programmer to transgress good design principles. It is frequent in a 
real program that the original developer or a later maintainer has modified the 
clarity of the underlying design in some instances, either to earn efficiency or to 
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save work. The fact of rejecting candidates, which had small superposition or 
violations of good design practices, would reduce unnecessarily the usefulness of 
finding objects. 

In addition, the definition given above does not distinguish clearly between 
the concepts object class and object. In some cases, it can be easier to find first 
the class and then its instances and in others, inside out. Therefore, it is more 
convenient to try both together. 

The two approaches to find objects seem to be useful. The first one is based on 
persistent and global data and establishes links with the routines that manipulate 
them. The second one is based on data types and establishes relationships among 
these types and the routines that use them as formal parameters or return values. 
A detailed description of two methods to implement these approaches can be 
found in [20]. 

Problems of the object identification basic method. None of any au- 
tomatic method can completely recognise the abstractions implemented in a 
software system. An approximation can only be produced where every candi- 
date set constitutes a module that implements an abstraction. Probably, human 
intervention will be needed to improve the quality of the modules, detecting 
undesired relationships among the components of a candidate set [8j. 

Some problems in automatic object identification have been detected and 
analysed. That is why a totally automatic strategy is simply unreal. 

Problems of the global-based method. Looking for candidate objects implies iden- 
tifying remarkable subgraphs. For example, each set of variables and routines 
associated with the isolated subgraphs can be a candidate to implement an ob- 
ject. However, a candidature approach based on the search of isolated subgraphs 
only produces satisfactory results for ideal object-oriented designed systems. 

Probably, for most real life systems, it produces too big or low quality can- 
didate objects. 

Two types of undesired links among subgraphs can be considered [8] : 

■ Coincidental connections: caused by routines that implement more than one 
function, and each function corresponds logically to a different object. 

• Spurious connections: caused by routines that implement system specific 
operations that directly access data structures of more than one object. 

In some extreme cases, this method cannot give many objects. Instead, it 
will give one big object that includes all the program components (routines). 
One reason is that any routine that uses global data of two objects establishes 
a link between them. 

A future refinement phase would be necessary then, in which human interven- 
tion would improve the candidate objects excluding from the graph bad routines 
or data items. 

A way to detect the relationship degree between class methods and instance 
variables is proposed. 

For each class, subsets of its methods and instance variables can be consid- 
ered. The following variables: 
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■ total amount of class methods, 

• total amount of class instance variables, 

• amount of methods in the subsets union, 

• amount of methods in the subsets intersection 

would be some of those that could be combined to obtain a coefficient, in 
order to determine if the class should be split. Finding out this coefficient is 
a very hard task because there are multiple dimensions to take into account. 
Besides, it would be necessary to keep in mind the semantic of the problem. 

Problems of the type-based method. In many cases, this method can produce 
’’too big” objects. Frequently, the guilty is some type that a human can detect 
as irrelevant for the objects that are being identified. 

There can be some false links, originated by some type, which lead to classify 
in one object all routines and types involved. In these cases, some heuristics can 
be used to reduce such conflicts (for example, delete primitive types and so 
on). Furthermore, it is a good opportunity for the user to intervene. In fact, 
the mentioned problem is also found in object-oriented design, where certain 
methods can be considered as belonging to anyone of two classes of objects. 

In this method, it is assumed that a given type is part of another. But in 
cases in which the types have no relation, the topological ordering presented 
does not completely characterise the relative ordering of all types. 

Therefore, this routine classification will give some routines that can be clas- 
sified under more than one type (and more than one object). 

3.2 The proposed method 

Improvements to the global-based method. The following decisions were 
taken to get better results: 

• Global data are only those variables and constants declared in the main 
program. 

• Before assuring that a global variable or constant is used or modified by a 
routine, it is verified that neither a local variable nor a formal parameter has 
the same name. 

■ The direct use/modification of global data is considered to detect the rou- 
tines associated to each of them. Any global is not directly used/modified if it 
is a real parameter of any user-defined routine. 

Improvements to the type-based method. It would be possible to work 
with an outline of an alternative classification of types based on the complexity 
of all the program data types. This outline consists in ordering all the data types 
according to their complexity. That is, if a type x is more complex than a type 
y, type x should be used instead of type y when the routines are classified. This 
complexity is expressed by means of a complexity index function (IC) [22]. 

The types domain size computation defined in this work is different from the 
complexity index function proposed in [22]. Besides, to complete this classifica- 
tion, types categories are introduced. 
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Types domain size computation. The domain size of each type is directly pro- 
portional to the amount of states that it can take; so it is purely syntactic. It is 
computed as follows: 

• primitive type: Cp = lg 2 #states (e.g. Cbodean = 1, Cchar — 8) 

• simple type: = lg 2 #states 

■ record type: Cr = ^[li * {field domain size)i] 

where /j = 0, if fields is a pointer to itself, and Ii = \ otherwise. 

• finite collection (e.g. array): 

Cfc = (domain size of its elements type) * (collection size) 

■ file: Cfite = (domain size of its elements type) 

• infinite collection (e.g. linked list): 

dc = (domain size of its elements type) + n where n — » oo. It should be 
always a pointer to a dynamic structure, but not a pointer to itself. In this case, 
for example, the pointer type to a linked list is considered but not the linked list 
node type. 

However, this way of computing the domain sizes is not always suitable. This 
is a good moment for the user to modify the type domain sizes he considers are 
wrong. For example, if a student name field is declared (in a university system) 
as string[64], its domain size will be 512. But, if the user knows that the system 
can end up containing at most 32000 students, then a more appropriate domain 
size would be 15 (the result of log2i20QO). 

Types categories classification. Besides, each type will be classified in one of the 
following categories (listed from lower to higher priority): 

• primitive type 

• simple type 

• file type: types from this category are used, in general, for backups. For this 
reason, they have an associated structure in main memory (finite or infinite) on 
which most of the operations will be done; however, in exceptional cases, the 
work is directly done on the file. 

■ finite structure 

■ infinite structure: it is an infinite collection, or the type category of some 
of its fields (if it is a record) or the type category of their components (if it is a 
collection) is an infinite structure. 

Parameter category classification. In addition, the parameter passing will be 
classified (parameters and return type) in the following way (from lower to higher 
priority): 

■ return type 

• parameter passed by value 

• parameter passed by reference which is only used in the routine body 

• parameter passed by reference which is modified in the routine body 

• infinite structure passed as parameter 
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Criteria to assign a routine following the type-based method. 

• If the routine has neither parameters nor return type, or if the highest pri- 
ority category between them is either primitive or simple, then it is not assigned 
to any type. 

• Otherwise, it will be assigned in the following way: 

- to the type with the highest priority category; 

- if there is a tie, to the one with the biggest domain size; 

- if there is a tie again, to the one with the highest passage priority. 

If after applying the previous criteria, tying between two or more types still 
continues, the routine will be assigned randomly to one of them. 

It is necessary to highlight that in some cases the assignment obtained using 
the proposed criteria could be improved. 

For example, it would be possible to have two types T\ and T 2 belonging to 
the same type category, with domain sizes ct\ = 105 and ct 2 = 100, respectively, 
where the T\ parameter is passed by value and the one corresponding to T 2 , by 
reference. As it was previously stated, the routine would be assigned to type 
Ti, although type T 2 could be probably more convenient. However, to compare 
two types domain size, two ranges could be considered (cj’i = [100.. 110] and 
ct 2 = [95. .105]), assuming they have the same domain size if their intersection 
is not empty. In this case, the routine would be assigned to type T^. 

Another unreliable situation appears when the category of the type is prim- 
itive or simple, in which case the routine is not assigned, though it is sometimes 
convenient to assign it. For example, it would be suitable to assign to the string 
type a routine that compares if the first string parameter were alphabetically 
smaller than the second one. 

The output of this method is a class for each type declared by the user with 
at least an associated routine. But also, a class will arise for each record type 
declared, although without any associated routine in order to access its fields by 
the corresponding methods automatically created. 



Combination of the methods. Both methods were studied and analysed to 
find their advantages and disadvantages. This suggested that the better way to 
obtain objects from structured and modularised imperative systems is combining 
these methods. In this combination, global variables, which show the existence 
of a class instance, and data types, which represent in a primitive form the 
object classes in the imperative languages, should be taken into account. The 
more complex part is the definition of such combination in order to get the better 
results. It was also concluded that if only source code is going to be considered, a 
better way for locating potential objects different from the previously mentioned 
ones does not exist. 

First of all, the global-based method is applied. The instance variables of each 
one of the classes that emerge by this method are the global data that remain 
grouped. Next, the type-based method is used without taking into account the 
routines assigned to the classes computed by the previous method. The instance 
variables of each one of the classes defined by this method are the fields, if a 
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record is being considered. In any other case, as the class is a specialisation of 
some existing class (for example, the Array class), it has no instance variables. 
In either case, the methods of each class are the routines assigned to it. 

However, it is probable that some routines and global variables remain with- 
out being assigned to any object. The routines are those that neither use nor 
modify global variables directly and have no parameters. The global variables 
are those which are neither used nor modified directly in any routine but are 
real parameters of user-defined routines. 

Then the system class is identified. It is the class that has among its meth- 
ods the routine that was the main program in the imperative source code. An 
instance variable for each class determined by the global method is added to its 
instance variables. It is also added one instance variable for each global variable 
remaining unassigned. Its methods are increased with the unassigned routines. 

Nevertheless, if the routine that was the main program does not appear 
among the methods of any class, the system class is created. As occurs when 
it already exists, an instance variable for each class determined by the global 
method is added to its instance variables. It is also added one instance variable for 
each global variable remaining unassigned. Besides, its methods are augmented 
with the unassigned routines. 

In addition, the instance variable parent is added to each one of the identified 
classes, with exception of the system class. It is a reference to an instance of the 
system class. 

An alternative would have been that the system class had accessed by other 
classes through inheritance. However, this option might be only viable if multiple 
inheritance were available; if this were not the case, such inheritance link might 
produce conflicts with other possible parents. For this reason, to reach the stated 
objectives and though less convenient, a client link was selected instead of the 
inheritance link. 

As the system class is the one that contains among its methods the routine 
that was the main program in the source code, it starts the program execution. 
In [21] it is called root class. 



Inheritance. The greatest inconvenient until the moment is inheritance detec- 
tion within the extracted objects. 

There are two ways of using a class: inheriting from it or being its client. 
Among all issues in object technology, none causes as much discussion as the 
question of when and how to use inheritance [21]. Besides, during the devel- 
opment of this work the existence of a third relation was found: different im- 
plementations of the same object. This is due to the fact that in imperative 
programming there is a tendency to distribute a data (object) according to its 
persistence (dynamic structures, files, tables and so on). 

It is not possible to detect syntactically if one of these three relationships 
is present or if none of them holds. Furthermore, in [2] it is assured that the 
similarity among routine names and/or data structures is not reliable. There 
can exist ’’false homonymous” that accomplish completely different functions 
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(or implement different data structures), and also ’’unknown synonyms” can 
have the same internal structure and accomplish exactly the same function. As 
a rule, the structural similarity (combining aspects of data and control flow) is 
not necessarily connected with functional similarity: the same function can have 
totally different implementations and the same structure can accomplish totally 
different functions. (A program structure that handles tables can produce com- 
pletely different services, depending on the data values stored in their tables). 
The functional similarity check ups are accomplished with human experts inter- 
vention. The case that could be detected with greater safety is that of different 
implementations of the same object, if one corresponds to objects stored in a 
main memory structure and the other, to objects stored in a secondary one. 

After a study of many cases and alternatives of imperative code for objects 
with and without inheritance it was concluded that the following elements might 
syntactically be analysed: attribute names, attribute types, modules functional- 
ity and functions and procedures in which attributes appear as parameters. 

It can also be asserted that none of these elements is reliable to detect in- 
heritance automatically from structured and modularised program source code 
because syntactic similarity does not have to mean semantic similarity. 



The object identification prototype. A prototype to automatically trans- 
form a Pascal program into a Smalltalk object-oriented system was implemented. 
User intervention is allowed to improve the results and thus, the user is consid- 
ered the intelligent agent and the prototype helps him providing simple but 
computationally intensive services. 

To validate and examine the effectiveness of the method, twenty examples 
were analysed and transformed. They had 2000 lines and seventy routines, ap- 
proximately, and they corresponded with three different specifications. The first 
specification represents an e-mail system; the second one, a net simulation; and 
the third one, an Html navigator. On the other hand, these same specifications 
were given to object-oriented designers in order to develop the systems from 
scratch. These systems were compared to those transformed by the prototype. 
Once the results were analysed, it was concluded that the last ones do not only 
maintain the original functionality but they also have an acceptable structure. 



4 Conclusions 

This prototype shows that the transformation proposed is possible, but its com- 
plete development is a challenging task and some steps need to be enhanced. 
The output system is only a ’’first-cut” object representation that should be 
subsequently improved in order to obtain a more suitable object-oriented model. 
People working in these areas agree in that a complete automation of these pro- 
cesses is not feasible and human participation is needed specially when applying 
them to real systems. This conclusion leads to a need for software tools that can 
assist software engineers in their task. In addition, software visualisation aids 
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human comprehension by representing structure visually and eliminating irrel- 
evant details. In this context, the development of an interactive reengineering 
toolkit would be really helpful because in legacy systems, source code constitutes 
a rich domain of structural as well as flow information. This toolkit would have 
to automatically construct visual representations from source code and it would 
have to allow direct code accessing in its visual environment, remaining in charge 
of the tool the low level tasks to maintain the original functionality. Besides, it 
would be interesting to let the user select any desired granularity to view source 
code. All these facilities would produce a significant positive effect on program 
comprehension and understanding. In particular, the tool should assist the user 
during the restructuring step and most importantly during modularization and 
object-oriented conversion. It would be of great help if the tool gave assistance 
to let the user modify the program, once the object-oriented program is derived. 

A formal verification to demonstrate that the detection of classes and their 
subsequent assignment of attributes and methods maintains the original func- 
tionality has not been accomplished yet. However, the reiterative execution of 
the prototype with multiples examples has shown that the extracted objects 
can compute the original functionality. Moreover, it can be assured that they 
possess similarity with those objects that might have been built through an 
object-oriented design. 
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Abstract. Many organizations have a number of mission-critical sys- 
tems that are out-of-date, but that are essential to their activities and 
cannot be discontinued. This problem is known as the Legacy System 
Dilemma, and it is usually solved by the migration of the existing sys- 
tems to a completely new environment. Although there are many strate- 
gies and tools to perform this migration, no methods are available for 
evaluating the performance of the new system before its migration has 
been completed. This paper presents CAPPLES, a capacity planning 
and performance analysis method for the migration of legacy systems. 
A real case study is presented where CAPPLES was successfully applied 
to predict the behaviour of a new version of a mission-critical legacy 
system. Details of how to use CAPPLES, such as the characterisation of 
the synthetic workload and the simulation of the new system, are also 
provided. 



1 Introduction 

Many organisations have a number of mission-critical systems that are out-of- 
date and very difficult to maintain [2,6,8], but that cannot be discontinued. 
These systems, known ais Legacy Systems, have became in evidence these days 
particularly due to the so-called Year 2000 bug [12]. There are many ways of 
minimizing the problems related to legacy systems. However, the migration of 
these systems to a new environment is usually the most effective approach to 
solving such problems. The new systems that originate from this migration are 
known as target systems. Several strategies [3,4,6,25] and tools [10,20] have 
been proposed to help the migration of legacy systems. For instance, reverse 
engineering [9, 21, 23] is one such a strategy. 

* This work was supported by COPASA-MG under cooperative agreement 3025. 
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On the other hand, during the life cycle of a system it is usual to perform 
capacity planning and performance analysis studies not only to evaluate the 
system’s environment but also to plan its operation [13,26]. This type of study 
would also be very useful when planning the migration of a legacy system. How- 
ever, traditional capacity planning and performance analysis methods [11, 16, 19] 
cannot cope with some problems concerning the migration of legacy systems. The 
first problem is that two systems must be considered: the legacy system itself 
and the target system. The second problem is that legacy systems are usually 
very large (sometimes with more than one million lines of code) and their target 
systems tend to be even larger, which makes the performance analysis study 
very complex. Moreover, the time required to perform the study is in general 
extremely long, whereas the time available to understand the whole migration 
context is very short. Therefore, any effort to understand the environment of 
both systems, as required by the traditional methods, should spent more time 
than is available. A third problem is that the workload characterisation must 
be based on information from both systems. The traditional methods do not 
explain how such workload characterisation should be carried out. 

Thus, although there are several strategies and tools available to migrate 
legacy systems, there are no methods to predict the behaviour of the target 
systems when legacy systems are shutdown. Someone can argue that adopting 
an incremental migration strategy, as proposed by Brodie and Stonebraker [6], 
an organisation can reduce the migration risk to a minimum. However, we recall 
that incremental migration is only possible for decomposable systems (as defined 
in [6]), and that minimal risk is not the same as no-risk. 

This paper introduces CAPPLES, a CApacity Planning and Performance 
analysis method for the migration of LEgacy Systems. The method was origi- 
nated from a specific need at COPASA-MG, the Water and Sewage Company 
of the State of Minas Gerais, Brazil, whose development team had to evaluate 
the performance of a target system before it became operational. Therefore, this 
paper presents a real case study where CAPPLES was successfully applied to 
predict the behaviour of a non-operational target system of a mission-critical 
and operational legacy system. 

The remainder of the paper is organised as follows. Section 2 defines some 
terms and presents the assumptions considered in this work. Section 3 presents 
CAPPLES. Section 4 describes SICOM, the COPASA-MG’s system which is the 
subject of this study. Section 5 describes how CAPPLES was applied to charac- 
terise the synthetic workload used in the simulation of SICOM. The simulation 
process is described in Section 6. Section 7 presents some experimental results. 
Finally, conclusions are presented in Section 8. 



2 Terminology and Migration Assumptions 

In this section, we define some terms and present the migration assumptions 
that we consider in this work. 
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2.1 Terminology 

Services. An application can be modelled as a hierarchy of tasks, each one with 
a specific goal [17]. Tasks can be decomposed into subtasks, that can be decom- 
posed in subsubtasks, and so on. Thus, tasks can be specified at different levels 
of abstraction. Services are top-level tasks composed of a set of conventional 
actions performed by their subtasks over the system. As a task, a service also 
has a specific goal. 

On-line transactions. There are subtasks that represent each interaction of the 
user with the application through the use of interactive devices, such as displays, 
keyboards, and mice. On-line transactions are less abstract tasks than services. 
In fact, an on-line service can be decomposed into many on-line transactions. 
Batch jobs. Services not associated with interactive devices are called batch jobs. 
Users do not interact directly with services of this category. In fact, printed 
reports are the usual output of batch jobs. Further, batch jobs usually process 
large amounts of data. Batch jobs can also be called batch services. 



2.2 Migration Assumptions 

Two basic assumptions must be considered when migrating a legacy system: 

1. The migration of a legacy system is time-consuming. The migration of legacy 
systems requires planning and preparation. In fact, the resources of the new 
target system environment must be installed and tested. Moreover, technical 
people must be trained to operate these resources, and users must be trained 
to use the target system. Worst, users usually can only be trained after the 
technical people are ready to operate the new environment. 

2. Developers know the target system. The target system developers know the 
individual behaviour of the services provided by the target system. Indeed, 
it is expected that these services were individually tested with success. Thus, 
the developers know what services in the target system are the most likely 
to present performance problems. 

3 The CAPPLES Method 

3.1 The Scope 

CAPPLES, as presented here, is a method general enough to deal with many dif- 
ferent migration scenarios. However, the method has a well-defined scope where 
it can be applied, as described below. 

Workloads are generated due to services of three categories: (1) interactive 
requests, (2) on-line transactions, and (3) batch jobs [19], Considering this work- 
load composition, CAPPLES is a method for evaluating the performance of 
target systems running services of any category, but where the on-line services 
are suspected to have performance problems. In fact, batch jobs rely basically 




CAPPLES 201 



on the servers’ resources (e.g., CPU, I/O subsystem, operating system). On- 
line transactions, however, rely on the whole environment, i.e., they rely on the 
servers’ resources, like batch services, but also on client computers’ resources, 
local networks, remote links, network devices (e.g., routers, hubs, switches), and 
local and distributed softwares (e.g., transaction monitors, operating systems, 
network protocols). Therefore, batch job performance problems tend be solved 
easier than on-line transaction performance ones. Further, interactive requests 
are not used intensively in legacy systems. Indeed, they are usually provided by 
stand-alone applications to perform specialised services. Therefore, interactive 
request performance problems can be solved in an ad-hoc manner. 

According to Jain [16], there are three approaches to predicting a system’s 
performance: (1) analytical methods, (2) experimental studies, and (3) simula- 
tions. Considering the problem of migrating legacy systems, the use of analytical 
methods requires many simplifications and assumptions about the target system. 
Thus, the resulting model would possibly not model the target system with accu- 
racy. Experimental studies are also usually not appropriate. Although the target 
system might be fully developed, its operational environment may not be set up. 
Besides this, in order to fit within the precision needed in this kind of study, the 
experiments would have to be carried out several times, requiring the allocation 
of many people in order to generate the predicted workload. Worst, it is hard 
to guarantee repeatability in such a kind of experiment. Therefore, simulations 
tend to be the most appropriate approach for such studies. Indeed, simulation 
models are very flexible for implementing the details required. Moreover, they 
are flexible enough to be modified, so that many scenarios can be evaluated from 
a validated model. 

3.2 Overview of the Method 

CAPPLES is composed of 9 steps which are described bellow. Additional details 
are presented in Sections 4, 5 and 6, and also in [7]. 

Step 1: Specification of the measuring time window. Many systems have 
an on-line window, where the on-line transactions have higher priority than 
services of other categories [19]. Inspecting the scheduled routines of a legacy 
system and its environment (e.g., operating system tuning and transaction mon- 
itors tuning), it is possible to identify the on-line window, which corresponds to 
the measuring time window to be used in Step 2. The measuring time window 
usually covers the week-days from 9 am to 5 pm, when users are effectively in- 
teracting with the legacy system. However, its specification depends essentially 
on the organisation’s activities. For example, the measuring time window for 
supermarkets can have extended working hours, including weekend-days. 

Step 2: Measurement of the legacy system and specification of the 
simulation time window. In order to be able to collect the data required to 
evaluate the performance of the legacy system, system monitors must be chosen 
to be used along the measuring time window. Although it is possible to collect 
hundreds of different types of performance data, CAPPLES only requires the fol- 
lowing: the mean response time (MRT) of the legacy system on-line transactions 
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to be used for comparison with the simulated MRT of these same transactions 
(Step 9), and the frequency of each service during the measuring time window, 
summarised by hour (or half-hour, quarter-hour, etc.). These data help the iden- 
tification of the peak hour (or half-hour, etc) of the legacy system in terms of 
its utilisation. This peak hour is the simulation time window that will be used 
in Steps 7 and 9. There are two aspects that should be observed: (1) the legacy 
system can be monitored for as long as possible, but the workload profile tends 
to be almost the same, week by week (this happens since only high-frequent 
services are counted, as described in Step 3); (2) the duration of the simulation 
time window should he large enough to accommodate the occurrence of services 
that are relevant for the composition of the synthetic workload as described in 
Step 5. 

Step 3: Identification of the relevant services in the legacy system. 
CAPPLES adopts the assumption that usually 20% of the programs are respon- 
sible for 80% of the system workload [14]. The proportion could not be exactly 
20/80, but the idea is to identify a reduced set of services that are responsible 
for a significant part of the system workload. Once again we have a trade-off 
between the required precision of the performance prediction and the available 
time for performing the study. It is important, however, to observe that a general 
understanding of the services selected is required. However this is a very difficult 
task since legacy services are usually not well-documented. Based on this trade- 
off, services that compose the synthetic workload are selected, as described in 
Step 5. It is important to respect the selection order. Services cannot be selected 
if they are less frequent than other services not selected. 

Step 4: Mapping of the relevant services from the legacy system to the 
target one. In CAPPLES, the intensity of the synthetic workload is provided 
by the legacy system and the resource demand^ of each service is provided by the 
target system. Therefore, it is required to map the on-line transactions from the 
legacy system to the target one. However, a direct on-line transaction mapping 
is usually not possible in practice, since it is very difficult to recognise the correct 
equivalence between the transactions in both systems. Thus, in our method, this 
mapping is carried out indirectly through on-line services, as shown in Fig. 1. 
This is possible because on-line services are higher-level abstractions (see Section 
2.1) that encapsulate conventional actions that must exist in hoth systems. 
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Fig. 1. Mapping of on-line transactions. 
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We notice, however, that mapping on-line transactions into on-line services 
in the legacy system is not usually a trivial task. The reason for this difficulty is 
that legacy systems often have out-of-date documentation and few people have 
a comprehensive knowledge of how they have been implemented. On other hand, 
users know very well how to operate legacy systems and developers have a com- 
plete understanding of the target systems. Therefore, the collaboration between 
users and developers usually solves the mapping between on-line services. Addi- 
tionally, the mapping of on-line services into on-line transactions in the target 
system can be solved by the development team. 

Step 5: Generation of a synthetic workload for the target system. 
At this step the synthetic workload is partially ready, since the intensity of 
the services was identified in Step 2 and the composition of the workload was 
identified in Step 3. The other service demands such as CPU time, local and 
remote network utilisation, and hardware specification must be taken from the 
target system and its environment. The exact composition of the service demands 
depends on the system’s environment considered (e.g., database management 
system, local network, remote network, etc.), the composition of the workload 
(e.g., on-line services, batch jobs, spooler jobs, etc), and also the performance 
monitors available in the legacy and target system environments. In fact, using 
the identified composition and intensity of the workload, the remaining workload 
characterisation (and even the remaining capacity planning and performance 
analysis method) is the same as for the traditional methods. 

Step 6: Modelling the target system. Modelling the target system requires 
a precise comprehension of it. Hopefully, the target system developers can help 
performance analysts in this task. Therefore, using a general purpose discrete 
event simulator (e.g., SES / workbench [22], SMPL [18]), the target system can 
be modelled by describing its characteristics that affect the behaviour of the 
on-line transactions. Three important aspects should be observed at this step: 
(1) High-level detailing is not usually required when modelling the target system. 
Indeed, the best approach is to start with a simple model, increasing the details 
gradually, until the model is considered validated (this is the calibration process 
in Step 7); (2) The model must contemplate all services identified during the 
specification of the simulation time window (Step 3). Therefore, if there were 
batch jobs, or even interaction requests, running during the simulation time 
window, they must be considered in the simulation; (3) Every resource that can 
be temporarily blocked by a service (e.g., database pages [1], LAN’s, printers) 
or that have high probability of being saturated (e.g., remote links with small 
bandwidth) is a candidate to be modelled. 

Step 7: Calibration and validation of the target system model. As CAP- 
PLES focuses on on-line transactions, the model validation is based on the MRT 
of the on-line transactions. Therefore, a simulation model is considered validated 
if the MRT of the modelled on-line transactions corresponds to the measured 
MRT of the same transactions in the target system. It is important to observe 
that the workload must be a relevant one, and that the measured environment 
must be as complex as the real target system environment. This environment. 
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however, does not need to have the same size as the real environment. Indeed, a 
small synthetic workload, in terms of active users, can be larger than real work- 
loads, since the synthetic workload intensity can be much larger than the real 
workload intensity. To achieve this result, we just need to set the virtual user 
think-time as close as possible to the response time. 

Step 8: Workload prediction. The workload generated in Step 5 corresponds 
to the current demand of the organisation. In this step, this workload should 
be conditioned in order to reflect the moment that the target system will be in 
charge of all the services. Workload forecast techniques that take into account 
business units of the organisation can be used [19]. 

Step 9: Target system performance prediction. Having designed the pre- 
dicted workload (Step 8), it can then be submitted to the validated simulation 
model (Step 7), generating all required information concerning the behaviour of 
the target system. The advantage of using a simulation model is that it can be 
easily modified, so that different scenarios can be evaluated. 

In the next section we discuss the use of CAPPLES to predict the behaviour 
of a real target system. 

4 SICOM: A Case Study for CAPPLES 

CAPPLES was used to predict the behaviour of a target system, called SICOM, 
which was developed to replace the commercial system operating at COPASA- 
MG since 1985. This commercial system, called HP07, is a mainframe-based 
legacy system composed of about 3,500 programs written in COBOL and Nat- 
ural. As a legacy system, the HP07 system was not fulfilling the organisation 
needs. Thus, a migration effort was initiated to develop a completely new system, 
the SICOM. 

After five years of development, SICOM was ready to replace the HP07. How- 
ever, as this system is a mission-critical one for COPASA-MG, its replacement 
should be carefully carried out. In fact, the HP07 system is a mission-critical 
system since it is in charge of all commercial tasks and some of the operational 
tasks of an organisation responsible for water supply for more than 5 million cus- 
tomers. As the organisation is fully adapted to work according to the response 
times of the legacy system, CAPPLES was used to predict the response times 
of SICOM’s on-line transactions. Therefore, the expected response times of the 
target system were known before the migration took place, so that the organ- 
isation could, if needed, adjust its services to work according to the predicted 
response times. 

SICOM has 1,291,087 lines of code, divided in about 6,500 programs, and was 
entirely developed in Natural using the ADABAS database management system 
[24]. It was developed to operate in a distributed environment. In contrast with 
HP07, which processed data from the whole state of Minas Gerais in a centralised 
mainframe, SICOM will operate in a distributed environment in seven distinct 
regions of Minas Gerais. In this study, we have analysed the behaviour of SICOM 
in the Northern Region’s environment. However, the results of this work can 
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be extended to the other regions in a simple manner. The Northern Region is 
divided in three districts named Montes Claros, Janaiiba, and Januaria. SICOM 
will operate in Montes Claros, and users from Janaiiba and Januaria will use the 
system through telnet sessions. The three districts are interconnected using 
X.25 links. Fig. 2 illustrates this operational environment. 



Montes Claim 




Fig. 2. The Northern Region environment. 



5 Workload Characterisation 

In this section, we describe how CAPPLES was used to predict the workload that 
will be submitted to SICOM. Each following subsection is related to a specific 
step of CAPPLES. 

5.1 The Measuring Time Window 

The first issue to address in CAPPLES is the specification of a measuring time 
window in which the system will be analysed. The measuring time window spec- 
ified for SICOM corresponds to the commercial hours of COPASA-MG, which 
includes all weekdays from 8 am to 6 pm, plus two extra hours, 7 am to 8 am, 
and 6 pm to 7 pm. These extra hours are due to the number of on-line services 
executed during such hours, as reported by the IBM CICS/MVS [15], which is 
the on-line transaction monitor in use. 



5.2 Measures Taken from the Legacy System 

Once the measuring time window is defined, the legacy system must be moni- 
tored. In CAPPLES, two measures are of vital importance: (1) the frequency of 
execution of the legacy system programs and (2) the average response time of 
these programs. 

Another information considered in this case study, mainly due to the dis- 
tributed architecture of SICOM, was the identification of the terminal of each 
executed program. This information allowed us to find out the data belonging to 
the Northern Region and, therefore, use them to predict the workload for that 
region. As we suppose that the legacy system is still operational, these measures 
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can be easily taken using any on-line transaction monitor available. For instance, 
in our case study we have used the CICS Manager monitor [5] to collect these 
performance measures. 

5.3 Identification of the Relevant Services in the Legacy System 

Having identified the most frequently executed programs, it is necessary to find 
out which services these programs are related to. Usually, this can be accom- 
plished through interviews with users of the legacy system. Alternatively, the 
source code of the legacy system can be analysed. 

Suppose, for instance, that the programs PT78 and PT88 were the most 
frequently executed programs in HP07. Analysing the source code reveals that 
these programs refer to, for example, the “order entry” service. So, the frequency 
of execution of the programs PT78 and PT88, in this case, is about the same 
as the frequency in which an order entry is executed in the system. In this way, 
we can find out the frequency of execution of the most relevant services in the 
legacy system. 

An important issue to address in this step is the number of services that must 
be analysed and modelled with CAPPLES. This will vary from system to system. 
In this case study, we have used 22 services, corresponding to about 30% of the 
total processing activity of all systems run on the COPASA-MG’s mainframe. 
In terms of the HP07 system, these 22 programs correspond to about 70% of the 
use of this system. Adding more services would not be helpful, because these new 
services would contribute little to the overall statistic. We have observed that 
the inclusion of the next most executed service (i.e., the 23rd) would correspond 
to an increase of only 0.6% the representation of the overall mainframe usage. 

5.4 Mapping of Relevant Services 

Once the most relevant services and their execution frequency are determined, it 
is necessary to map these services firom the legacy system to the target one. As 
these services are high level abstractions of the real system’s transactions (e.g., 
an order entry in the system), this mapping can be easily accomplished through 
interviews with the developers of the target system. 

At this point, after the complete mapping of the relevant services, we have 
the frequency of execution of these services in the target system. However, as 
the target system might have added new functions, it is difficult to find out how 
these services will be executed in the new system. Possibly there are different 
ways of performing the same service in the target system (we call them use 
cases). Therefore, it is necessary to analyse every possible use case in the target 
system. Further, we need to distribute the frequency of execution of the services 
between these use cases. Sometimes the data measured in the legacy system can 
help at this step. 

In the case study, we jumped from 22 services to 43 use cases in the target 
system. Fig. 3 illustrates the process of identifying and mapping services from 
the legacy to the target system. 
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Measures taken from 
the legacy system 




Legacy 

System 



Target 

System 



Workload 



Fig. 3. The workload characterisation process. 

5.5 Generating the Synthetic Workload 

In order to generate the workload that will be submitted to the target system, 
two characteristics must be determined: its intensity and its services’ demands. 
Up to this point, the intensity of the workload has already been defined by the 
prior steps of CAPPLES. 

The measurement of the services’ demands of the relevant services can be 
accomplish in a straightforward manner, since we assume that the target system 
is fully developed. It Is simply a matter of using accounting tools of the operating 
system (e.g., acct, entstat) for measuring the services’ demands while the target 
system is running. 



6 Simulation Model 

A general purpose simulation tool, SES/ Workbench [22] , was used to build the 
simulation model. The following components of the target system were modelled 
(see Fig. 4): 

Workload. This corresponds to the predicted workload intensity and services’ 
demands generated with CAPPLES. 

Natural Subsystem. Each program present in the workload model competes 
for some system’s resources (e.g., CPU, LANs). The Natural Subsystem rep- 
resents this use of resources by the applications. 

ADABAS Subsystem. The ADABAS DBMS was modelled since it uses CPU 
cycles and makes accesses to the I/O subsystem. 

LANs and WANs. Ethernet and X.25 models were also created since they 
can be sources of contention and increase the response time of the system. 

CPU and I/O Subsystem. The system’s CPU and I/O subsystem were 
the main resources modelled. We assumed that enough main memory was 
available, so it would not influence the system’s performance. 

Initially, the workload component generates transactions^ following the pre- 
dicted workload. These transactions, as they enter the Natural Subsystem model, 

^ The term transaction, here, refers to simulation transactions that can be considered 
as threads in the simulator. 
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consume CPU cycles, and use the AD ABAS subsystem model to make the I/O 
subsystem accesses. The ADABAS subsystem also uses some CPU cycles. Fi- 
nally, the transactions travel through the LAN and WAN models and terminate. 
The cost and quantity of these CPU and I/O subsystem accesses are defined by 
measures taken from the target system, as considered by Step 5 of CAPPLES. 
Fig. 4 gives a high level illustration of the model developed and includes only 
its main components. However, batch jobs and ftp services^ were also modelled, 
since they were identified in the simulation time window. An in-depth description 
of the whole simulation model developed can be found in [7]. 



CPU Subsystem 




Fig. 4. The top-level diagram of the SICOM simulation model. 

In order to validate the simulation model built and make sure it represented 
tbe real system accurately, some experimental results were compared to the 
simulation output. 

The experiment consisted in measuring the response time of SICOM when 
exposed to different synthetic workloads. Three sets of measures were taken, rep- 
resenting workloads generated by 8, 14 and 24 simultaneous clients. In order to 
measure the response times of the on-line transactions, three possible approaches 
were considered: 

Modification of the SICOM source code. The target system source code 
could be modified to register the system response times. The drawback in 
this case is that the network time is not measured. 

Insertion of a net spy. Another choice is to insert a network spy that receives 
the client requests and forwards it to the server and vice-versa saving the 
response times in the process. In this approach, new network packages are 
created, generating considerable system overhead. 

Using a modified telnet client. In this solution, a modified telnet client 
generates the workload and saves the send request time. When the answer 

® This new category was created since ftp services are a mixture of on-line and batch 
services, in terms of performance demands, as found in SICOM. 
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is received, the response time is computed and logged in the memory of the 
client computers. This was considered the best approach, since it mimics the 
real world actions in a repeatable manner. 

Using the third approach, we compared the experimental results with the 
simulation output. After four significant modifications in the simulation model 
(and others not so significant), the difference between the measured and simu- 
lated results were no higher than 5%. 

The experiment carried out for the simulation validation was one of the 
most time demanding tasks in the whole study, mainly due to difliculties in 
generating the workload and logging the response times. This shows that the 
use of simulation instead of experimental methods is really a good choice. 

7 Experimental Results 

In this section, we describe the results of two experiments carried out using the 
validated target system model. The results of the first experiment show that 
the target system can cope with its expected workload. The second experiment 
produced a target system profile in terms of resource utilisation. In both exper- 
iments, the confidence interval was built upon the on-line transaction’s average 
response time. All of our results are expressed as a mean value plus or minus a 
few percentage points, since we built confidence intervals of at least 90%. 

The results of the first experiment show that the target system will work, 
provided that its response time is smaller than the response time of the legacy 
system. Indeed, organisations are adapted to deal with the response time of the 
legacy systems (e.g., number of concurrent users, number of required transactions 
per hour, etc.) and a significant increase of the response time could require 
changes in the organisation that they cannot cope with. 

The graphic of Fig. 5 (a) shows the simulated mean response time of the 
target system predicted by year. The graphic also shows that the simulated mean 
response time of the target system will be 612.07 ms in 1999, which is smaller 
than the 960.00 ms of the legacy system. This means that, in general, the target 
system is a real improvement in terms of performance. The mean response time, 
however, is not enough to know if we can shutdown the legacy system, since 
essential transactions could have unacceptable response times, e.g., 10 or more 
minutes. As we work with a simulation model, we can trace the response time 
of all transactions, producing an histogram, as shown in Fig. 5 (b). 

Having solved the main problem, the simulation model can provide additional 
information about the behaviour of the target system, such as the resource util- 
isation shown in Table 1. There, we notice that the resources allocated to the 
target system are far from being saturated. However, some attention should be 
spent to the database server’s CPU, since this is the resource with the highest 
level of utilisation. Other resources, such as local networks and the database 
server’s I/O subsystem, are extremely lightly loaded. 

This resource utilisation table is only a snapshot of the target system en- 
vironment, at migration time. As the utilisation of computer resources is not 
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[Response Time Range (ms) 


Percentage 

(%) 


from 


to 


0.000 


100.000 


8.10 


100.000 


300.000 


31.55 


300.000 


500.000 


18.78 


500.000 


900.000 


18.05 


900.000 


1300.000 


11.47 


1300.000 


2000.000 


8.46 


2000.000 


3000.000 


2.94 


3000.000 


+infinity 


0.64 



Year 

(a) (b) 



Fig. 5. Mean response time (a) and histogram (b) of SICOM’s transactions. 



Resource 


Utilisation (%) 


Database server’s CPU 


48.184 


Database server’s I/O subsystem 


1.557 


Montes Claros local network 


0.071 


Januaria’s local network 


0.011 


Janauba’s local network 


0.007 


Remote link between Montes Claxos and Januaria 


20.347 


Remote link between Montes Cleu-os and Janauba 


18.882 



Table 1. Resources utilisation. 

linear, the simulation model must be used to predict the behaviour of the target 
system after the migration. 

A third experiment was carried out to compare the performance of the target 
system with its hypothetical client-server version, showing the benefits of the 
migration from the multi-user system to a client-server environment. Details of 
the modified version of the target system model and the results achieved are 
available in [7]. 



8 Conclusions 

This paper presented CAPPLES, a method for evaluating the performance of 
target systems during the migration of legacy systems. The method has been 
shown to be effective in practice since it was successfully used to evaluate a 
target system with more than one million lines of code, as described in the case 
study presented here. 

The performance analysis of the case study spent 8 months, which is an 
acceptable time since the development of the target system took 5 years and 
the planning for its migration 10 months. In fact, the performance analysis took 
place in parallel with the migration planning. The evaluation of the case study 
took 8 months, but part of this time was invested in enhancing CAPPLES. Thus, 
if we were able to apply CAPPLES from the very beginning, we believe that the 





CAPPLES 211 



evaluation time could have been reduced to around 6 months. Moreover, as a 
significant part of this time was spent designing, calibrating, and validating the 
target system model, having the support of a simulation specialist would reduce 
this time even further. 

As we can see from our case study, using a method like CAPPLES is key for 
the success of the migration process, since performance problems can be identi- 
fied and solved before the target system becomes entirely operational. Moreover, 
the target system environment can be better specified, and computational re- 
sources requirements more accurately assessed. 

Finally, carrying out a performance analysis study during the migration of 
a legacy system is important not only for managers, but also for developers 
and users. Indeed, after the first year of development of a new system, the 
idea that something is going wrong can threaten the project. Spending 5 years 
(or even more) developing the target system, even the developers loose their 
confidence in the performance of the new system. Therefore, such a study is 
important to motivate everyone involved in the migration of a legacy system, as 
well to provide the organisation with accurate feedback on the inherent complex 
problems concerning the migration of legacy systems. In the case of SICOM, 
this study showed that the system’s environment had been well dimensioned 
and that, apart from the database server’s CPU, the resources allocated to the 
target system are far from being saturated. 
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Abstract. Reuse of available databases can support database design 
and reverse-engineering of databases by allowing design decisions to be 
derived from existing databases. 

This article proposes a method for reusing databases similar to the ap- 
proach used in case-based reasoning. Similar databases, or similar parts 
of databases, are first determined. We then discuss the information to be 
reused and how it c«m be validated. Two methods for building libraries 
are suggested for use in this process. 



1 Motivation 

Database design is the process of determining the structure of a database, se- 
mantics and its behavioral specifications. For this process, a designer can only 
use informal descriptions about the application, making database design quite 
difficult and time-consuming, and the results often depend on the designer cre- 
ativity and skill. However, the design process is crucial because the usability of 
a database depends on its design. 

Re-engineering of a database consists of a reverse-engineering process and 
a design process. During the design process, the derived conceptual schema is 
evolved, thereby, problems occur that also exist in database design. 

It is desirable to support the database design process with tools to check de- 
sign decisions or suggest improvements. Because of the abstraction process nec- 
essary to design a database, it is difficult to derive meaningful suggestions auto- 
matically. Reuse of existing databases can improve each of the tasks of database 
design. This article presents a method supporting the reuse of databases. It is 
organized as follows: 

The overview presented in the next section outlining the main tasks of the 
reuse approach. In section 3, related works are enumerated. Section 4 presents a 
method for finding similar databases. In section 5, we derive design decisions from 
similar databases and adapt these onto an actual database. Section 6 discusses 
the necessity of a revise process. We then present two methods for organizing 
libraries for supporting the reuse process, and end with a summary. 

2 Overview of the method 

It is widely accepted that the case-based reasoning approach requires the follow- 
ing four tasks (originally suggested in [2]): 
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RETRIEVE the most similar case 

REUSE the information and knowledge in that case to solve the problem 

REVISE the proposed solution 

RETAIN the information likely to be useful for future problem solving 

The key idea in reusing databases is similar to the idea behind case-based 
reasoning. We can adapt its tasks to the reuse of database design decisions, 
thereby resulting in the following tasks: 

RETRIEVE the most similar database or part of a database 

REUSE design decision for the actual database 

REVISE the proposed design decision 

RETAIN the information (e.g. build libraries suited to support the reuse) 

Our approach aissumes the following scenario. A set of existing databases is 
available, in addition to an actual database with incomplete structural, semantic, 
and behavioral information. We want to complete the database design, therefore, 
we search for a similar database among the set of existing databases in order to 
adapt design decisions. 

Before we present our method, we provide an overview of some related work. 



3 Related work 

Storey/Chiang/Dey/Goldstein/Sundaresan: Database Design with Common 
Sense Reasoning and Learning. [9] suggested a system supporting database de- 
sign by using available databases. The approach emphasized determining similar 
pairs of attributes, entities, relationships, and applications. For this compari- 
son, name information and an ontology aided in determining more complicated 
similarities, such as synonyms were exploited. Furthermore, a learning step of 
commonly valid databases for different applications is realized. These databases 
were then used to support the design of new databases. This method was tested 
with sample databases from well-known database literature. 

Castano/DeAntonellis et al: Schema Indexing, Clustering, Determining of Sim- 
ilar Databases. Several techniques relevant for reusing databases are described 
in numerous publications. [6], [4], and [5] suggested methods for determining the 
most important parts of a database (i.e. schema descriptors) by exploiting the 
number of paths, the number of attributes, and the hierarchy level of an object. 
The similarity between schemas was calculated by comparing schema descrip- 
tors ([6]) and by comparing all objects of the databases ([5]). Schema abstraction 
based on schema similarity was described in [4] and [5]. 

Song/Johannesson/Bubenko: Finding Similarities for Schema Integration. 
[10] deals with the problem of finding semantic similarities as a prerequisite for 
integrating schemas. The authors compared the meaning of entities and relation- 
ships by using ” integration knowledge” containing information about synonyms 
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and subset relationships. Attributes, key attributes, and cardinality constraints 
were also used to compare meanings of entities and relationships. This resulted in 
equivalent, compatible, and mergable schemas for use in the integration process. 

Bergmann/Eisenecker: Reuse of Object-oriented Software. In [3] the reuse of 
object-oriented software is realized as a type of case-based reasoning. The authors 
discovered that a method for determining similarities based solely on structural 
characteristics (names of methods, number and classes of parameters, and return 
value) returned poorer results than a method based on structural and semantic 
information. 



4 Retrieve 

If we want to reuse database design decisions, we first have to identify similar 
databases. This is a demanding task because databases are often complex and 
difficult to understand. We can only exploit available database characteristics 
(e.g. names, types, integrity constraints, transactions) for this task. 

The process of comparing parts of databases is complex and is best realized 
using a bottom-up approach, thereby basing comparisons of complex concepts on 
comparisons of simpler concepts. Our approach begins with methods for finding 
similar attributes in two databases. 



4.1 Determining similar attributes 

This section presents heuristics for finding similar attributes. For each heuristic, 
a similarity function is evaluated for results between 0 and 1 (0 - no similarities, 
1 - equality). The following database characteristics can be compared for deliv- 
ering similar attributes: 

Hal Attribute names (same names, same substrings in names, or synonyms) 

Ha2 Attribute types and lengths (same or similar) 

HaS Further structural information (e.g. enumeration types, default values) 

These types of structural information suggest similar attributes. Neverthe- 
less, their use will not determine all similarities because several homonyms and 
synonyms exist between two databases. [10], [9], and [5] assume that synonyms 
are exploitable. Synonyms are domain-dependent, making it impossible to use 
a synonym dictionary for delivering correct results in every case. Although we 
believe that structural information aids in comparing databases, it is beneficial 
to include additional types of available characteristics. 

When integrity constraints are already specified in the actual database and 
data are available, we can enumerate further heuristics: 

Ha4:. Keys (We determine if two attributes, A and B, are keys of their entities 
or relationships). 




216 



M. Klettke 



Ha^- Functional dependencies (If two attributes, A and B, appear to be sim- 
ilar, and the same functional dependencies are defined on these attributes, 
then this is an additional hint for similarity.) 

Ha 6 - Data (same data values of two attributes) 

If further characteristics of a database (e.g. behavioral information, transac- 
tions) are available, additional heuristics for comparing this information can be 
developed. 

We now have some heuristic rules for indicating similar attributes. We sub- 
sequently compare and weight these heuristics. The more heuristic rules are ful- 
filled, the greater similarity measure should be. The following simple estimation 
can be used for this task: 

6 

sim{A, B) := Wi * Hai{A, B) 

t=i 

Hai - result of heuristic rule G [0..1] 
Wi e [0..1], wi + .. + W6 = 1 



The weights Wi can be determined using the following table specifying the 
reliability of the results of the enumerated heuristics. The more reliable heuristics 
shall be weighted higher than the less reliable rules. 



very reliable 


Hal,Ha6 


reliable 


Ha2, HaS 


relevant only in combination with other rules 


Ha4,Ha5 



The similarity of an attribute set can be estimated in the following way: 

2*^^^^irn{Ai,Bj) 



sim{A\..An,Bi..B fTi ) . — 7Tl(lCn| 



n -I- m 

j G l..m, every j occurs only once 



Based on these similar attribute sets, we begin the search for similar entities. 



4.2 Determining similar entities 

When searching for similar entities (Ei,E2) in two databases (Z?i,D2), we em- 
ploy a method based on rules for determining similar attributes (>lii..>li„, 
^21 ••. 42 m) of the entities. Moreover, there are additional entity characteristic 
that can be included: 

Hgl Entity names (same names, substrings in entity names, or synonyms) 
He 2 Keys (same or different key attributes of two entities, Ei and E2) 

Both heuristics deliver very reliable results, and can be assigned the same 
weight. These heuristics can be used in estimating similarity measures for enti- 
ties: 
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sim{Ei,E2) := * Hei{Ei, E 2 ) + ^sim{An. .Ain, A2i..A2m) 

i=l 

Hei - result of heuristic rule e [ 0 .. 1 ] 
Wi 6 [0, 1], wi + 1 C 2 = 1 

In this manner, the estimation of similar attribute sets enters into the calcu- 
lation. 



4.3 Determining similar relationships 

When searching for similar relationships (i?i, i?2) in two databases (Di, £>2), we 
can use the rules for determining similar entities (Bn, £^12 and £21,^22) and 
similar attributes of the relationships(v 4 n..j 4 i„, yl2i Moreover, there are 

additional characteristics of the relationships that can be included: 

Hrl Relationship names (same names, substrings, and synonyms) 

Hr 2 Keys (same or different keys of two relationships, Ri and R2) 

HrZ Same inclusion and exclusion dependencies 

HrA Cardinalities (when determining a similarity measure we include the sim- 
ilarity of entities. Additionally, we compare the associated cardinalities.) 

The following table specifies the reliability of the results when using these 
heuristic rules. The weights Wi of the rules are derived from this overview: 



very reliable 


Hrl,Hr2 


relevant only in combination with other rules 


HrS, Hrd 



The similarity of two relationships can be estimated in the following way: 

sim{Ri,R2) ■— \ * Hri{R\,R2) + \simlBii..Bin,B2i..B2rn.) + 

^sim{Eii, E12) + ^Hr 4 {Ri, Ell, R2, E21) -I- 
^sim{E2i, E22) + \Hr 4 {Ri, Ei2, R2, E22) 

Hri - result of heuristic rule 6 [ 0 , 1 ] 

Wi € [ 0 , 1 ] 0 .. 1 , wi + W 2 + W 3 = 1 



In this estimation, there are two possibilities for comparing the associated 
entities: Bn - B21 and B12 - B22 or Bn - B22 and B12 - B12 the one with the 
highest similarity measure is chosen. 



4.4 Comparing more complex parts of databases 

Different structural descriptions of two databases can have the same semantics. 
The same information is designed as an entity in one database can be defined as 
a relationship in another database. To find such cases, we compare entities with 
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relationships. Therefore information about names, key, and belonging attributes 
can be used and weighted. Information about the similarity between an entity 
and a relationship is further used in the approach. 

Same concepts could be represented in two databases with different gran- 
ularity. For example, it may be that information represented by one entity in 
£> 1 , is represented by two entities and one relationship in £> 2 - We could find 
these similarities by evaluating the attributes and the names. The advantages of 
including such complex comparisons is that many more types of similarities can 
be found. However, it is more difficult to reuse information from such similar 
database parts because the adaption of design decisions is more complex. Fur- 
thermore, the search space increases. This method is mentioned here because it 
may be interesting for some applications, but, it is not used in our approach. 

4.5 Building a graph 

A typical problem occurring in the reuse of information is that one entity or 
relationship of a database may have similarities with several terms of another 
database. Consequently, we must determine which similar concept to choose for 
deriving further design suggestions. We apply a method known in graph theory 
as graph matching and use a bipartite weighted graph for representing the esti- 
mated similarities. 

Definition ([12]): A bipartite weighted graph G consists of a non-empty ver- 
tex set V (G) that can be divided into two disjoint subsets, S U T, and a set of 
edges E{G) C {(s,t)|s G S,t G T}. All edges in E{G) connect one vertex from 
S with one vertex from T. Every edge in the graph has a related weight. 

We can build a bipartite graph representing the estimated similarities as follows: 

1. We begin with an empty graph. 

2. For all determined similar entities and relationships (Vi S D\, V 2 € D 2 ), 
vertices are introduced (Vi in S and V 2 in T). We draw an edge between 
these vertices. The weight of the edge is the determined similarity measure. 
Next, we try to find a matching (a one-to-one relation) of the similar nodes 

with a maximal sum of all weights. Within a database context, this means that 
we search for similar parts of the databases (D[ Q D\, D '2 C D 2 : D[ 

We demonstrate the suggested approach using an ongoing example consisting 
of two different databases designed for a university application (figure 1). 

Figure 2 shows the bipartite graph that originates by overlaying the suggested 
method onto the two sample databases. We then determine a matching of this 
graph by constructing a cover, c, so that c > w{M). This means, the cover is 
greater or equal to the weight of the matching. We then search for the case where 
the cover is equal to the weight of the matching. If this cover is constructed, we 
have determined a maximal matching ([12]). Figure 2 presents the maximal 
matching for the similarity graph. 

For the sample databases, we determine the weight of the matching which 
is w{M) = 2.29. This weight is subsequently used to choose the database most 
similar to an actual one. 
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since 



identifier 

address 



University 2: 




Fig. 1. Sample databases 



The resulting similar and dissimilar parts [D\ — D[,D 2 — D' 2 ) of 

the databases are shown in figure 3. 

5 Reuse 

We now have an actual database, and we have determined a similar database 
or similar part of a database. This section demonstrates which information can 
be reused for the actual database, and how this can be accomplished. Reuse of 
information from available databases can support the design, or redesign process, 
of a database. 

5.1 Structural design 

First, we demonstrate the kinds of structural completions and expansions that 
can be derived. 
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Fig. 2. Maximal matching of the similarity graph 



Addition of attributes. If there are similar entities or relationships, Ei in D[ and 
E 2 in 3.nd if E\ contains attributes that are not in E 2 , then we suggest 
adding these attributes into E 2 - 

Addition of path information. If there are two similar entities and relationships, 
El, Ri in D'l and E 2 , R 2 in D' 2 , and if these are connected by a path, then we 
suggest adding this path information into D 2 . 

Addition of relationships. If there exists a relationship in Di— D[, and all as- 
sociated entities of Ri are in D[ (i.e. similar entities exist in the actual database, 
and the relationship doesn’t exist in the actual database), then the relationship 
Ri can be added in £> 2 - 

Figure 3 illustrates such a case. The entities Student and Professor of Univer- 
sity! have similar entities in University2, but the relationship supervise does not. 
Therefore, we suggest adding the relationship supervise as an extension of the 
University2 database. 

Addition of complex database parts. We suggest adding complex parts of the 
database D\ into D 2 , if similar entities Ei in D[ and E 2 in D '2 exist, and if Ei 
has a direct link to nodes in Di - D'y . 

For example parts exist in the University2 database that are not in the Uni- 
versity! database (figure 3). Therefore, we suggest adding the concepts City, 
studies, and lives to extend the entity Student. In this way, we derive meaningful 
suggestions for an inside- out design. 



5.2 Integrity constraints 

Integrity constraints can also be derived from available databases by looking at 
the following points. 

Functional Dependencies. If we determine similar attribute sets An..Ain in D'y, 
A 2 i..A 2 n in D' 2 , and if a functional dependency Ayi — > Ayj,i,j C !..n is valid, 
then the corresponding functional dependency in D '2 could also be valid. 
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Fig. 3. Similar and dissimilar portions of the sample databases 



Keys. If similar entities Ei in D[, in D'2, and similar attributes An. .Ain, 
and A2i..A2n are derived, and if the attributes A\i..Ain are a key of E\, then 
2l2i"^2n could also be a key of E2. 

Candidate keys for relationships are determined in the same manner. 

Inclusion and Exclusion Dependencies. If we find two databases with similar 
entities or relationships and similar belonging attribute sets, and an inclusion or 
exclusion dependency is defined on the attributes of Di , then the corresponding 
dependency may also be valid in £>2- 

Cardinality Constraints. If two databases with similar entities and relationships 
Ei,R\ in D'l and E2, R2 in D'2 exist, and the cardinality constraint card(i?i, Fi) 
is fulfilled in D[, we can also expect card{iZ2i E2) in Z>2 have the same value. 
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5.3 Behavioral Information, Sample Data, Optimization, and 
Sample Transactions 

There are several additional characteristics, that can be used in the same manner. 

If behavioural information is formally specified in available databases, then 
we can reuse this specification in similar databases. Furthermore, sample data, 
suggestions for optimization, and sample transactions can be reused. 



6 Revise 

Reusable characteristics only provide suggestions and must be revised in some 
way. 

Some design decisions can be checked without the user. For example, sug- 
gested integrity constraints can be checked to determine they do not conflict, 
and that they conform with sample data. 

Other suggestions have to be discussed with the user. If the reuse approach 
is used to derive suggestions for design tools, and a user is required to confirm 
the decisions, this demand is fulfilled. 

For integrity constraints, a method was demonstrated in [7] that acquires 
candidates for integrity constraints by discussing sample databases. 



7 Retain 

We have shown that the tasks retrieve, reuse, and revise can suggest design 
decisions for a database. To apply the method, we compared complete databases. 
We now demonstrate two methods for organizing the databases into libraries to 
support the reuse process more efficiently. 



7.1 Determining necessary and optional parts 

In [9] , the similar parts of databeises are stored for further use. In our approach, 
we store the parts occuring in every database for a field of application and 
the distinguishing features of the databases. Several points must be discussed 
with the user, who must decide which one of the synonym names of two similar 
databases is the most suitable. Further, the user has to confirm which attributes 
of an entity or relationship are relevant for a common case. 

In any event, the derived parts must be extended so that all databases are 
complete. This means if a relationship exists in the database, and not all as- 
sociated entities are in this database, then the entities have to be added. The 
positive side-effect of this action is that the database parts could be integrated 
based on these entities that now occur in the similar and dissimilar portions. 

If users want to design a database for one field of application, then they can 
be supported as follows; 
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1. The part occuring in every database is discussed. If it is relevant to 
an actual database, then we continue. 

2. The distinguishing features of the existing databases are discussed. 

If they are relevant, then they are integrated into the actual database. 

7.2 Deriving database modules for the reuse process 

A second method for developing libraries is dividing a database into modules. 
The question then arises of how to locate these modules. For this purpose, the 
following available information is exploited; 

- path information 

- integrity constraints (e.g. inclusion dependencies, foreign keys) 

- the course of design process (e.g. which concepts are added together) 

- similar names 

- available transactions 

- layout (if not automatically determined) 

All these characteristics determine how closely entities and relationships be- 
long together. A combination of these heuristic rules can be used to determine 
clusters in a database. These clusters are stored as units of the database, and 
the reuse process is based on these units. 



8 Conclusion 

The method presented in this article relies on heuristics and an intuitive way of 
weighting these heuristics. Therefore, we cannot guarantee that it always delivers 
correct results, however, the method is simple, easy to apply, and promising for 
many design decisions. One of its advantages is that many different database 
design tools can use the same method. 

This method was developed as a part of a tool for acquiring integrity con- 
straints [7]. The tool’s main appoach was to realize a discussion of sample 
databases and to derive integrity constraints from the users answers. Thereby, the 
approach presented in this article was one method to derive probable candidates 
for keys, functional dependencies, analogue attributes, inclusion and exclusion 
dependencies. The semantic acquisition tool was developed within the context 
of the project RADD^ 

Another practical application of the method will be embedded in the GET- 
ESS^ project that focus on the development of an internet search engine. There- 
by, we design databases for storing the gathered information by using ontologies. 
Because ontologies are domain-dependent and for every domain an own ontol- 
ogy is necessary, existing ontologies and the corresponding databases shell be 

^ RADD - Rapid ^Ipplication and Database Development Workbench [1], 
was supported by DFG Th465/2 
^ GETESS - GEriaan Text Exploitation and Search System [8] 
supported by BMBF 01/IN/B02 
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reused. The ontologies resemble conceptual databases, therefore, we can adapt 

the demonstrated method for use in this project. 
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Abstract. We propose a method for modeling complex Web sources 
that have active user interaction requirements. Here “active” refers to the 
fact that certoun information in these sources is only reachable through 
interactions like filling out forms or clicking on image maps. Typically, 
the former interaction can be automated by wrapper software (e.g., us- 
ing parameterized urls or post commands) while the latter cannot and 
thus requires explicit user interriction. We propose a modeling technique 
for such interactive Web sources and the information they export, based 
on so-called interaction diagrams. The nodes of an interaction diagram 
model sources and their exported information, whereas edges model tran- 
sitions and their interactions. The paths of a diagram correspond to se- 
quences of interactions and allow to derive the various query capabilities 
of the source. Based on these, one can determine which queries are sup- 
ported by a source and derive query plans with minimal user interaction. 
This technique can be used offline to support design and implementa- 
tion of wrappers, or at runtime when the mediator generates query plans 
against such sources. 



1 Introduction 

The mediator framework [Wie92] has become a standard architecture for in- 
formation integration systems. Whenever the integration involves large or fre- 
quently changing data, like e.g. Web sources,^ a virtual integration approach 
(as opposed to a warehousing approach) is advantageous [FLM98]: at runtime, 
the user query against a mediated view is decomposed at the mediator and cor- 
responding subqueries are sent to the source wrappers. The answers from the 
sources are collected back at the mediator which, after some post-processing, re- 
turns the integrated results to the user. When implementing such a framework, 
one has to deal with the different capabilities of mediators and wrappers: Media- 
tors are the main query engines of the architecture and thus are usually capable 
to answer arbitrary queries in the view definition language at hand. In contrast, 
wrappers often provide only limited query capabilities, due to the inherent query 
restrictions induced by the sources. For example, when a book-shopping medi- 
ator generates query plans, it has to incorporate the limited query capabilities 

^ Since we are focusing on Web sources, we often use the terms source and Web 
page/site interchangeably. 
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Fig. 1. An interactive Web source (ATM locator) 



of wrappers say for amazon . com or barnesandnoble . com in order to send only 
feasible queries that these wrappers can process. As shown in [GMLY99] query 
processing on Web sources can greatly benefit from “capability-aware” media- 
tors. 

In order to model source capabilities, various mechanisms like query tem- 
plates, capability records, and capability description grammars have been de- 
vised and incorporated into mediator systems. An implicit assumption of cur- 
rent mediator systems is that once the sources have exported their capabilities 
to the mediator, all user queries can be processed in a completely automated 
way. In particular, there is no interaction necessary between the user and the 
source while processing a query. Specifically for Web sources, there is a related 
assumption that the Web page which holds the desired source data is reachable, 
e.g., by traversing a link, composing a url, or filling in a form. 

While it is clearly desirable to relieve the user from interacting with the 
source directly, it is our observation that for certain sources and queries, these 
assumptions break down and explicit user interaction is inevitable in order to 
process these queries. 

Example 1 (ATM Locator). Assume we want to wrap an ATM locator ser- 
vice like the one of visa.com into a meditor system. There are different means 
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to locate ATMs in the vicinity of a location: Starting from the entry page, the 
user may either select a region (say North America) from a menu or click on 
the corresponding area on a world map (Fig. 1, left). At the next stage the user 
fills in a form with the address of the desired location. The source returns a 
map with the nearby ATM locations and a table with their addresses (Fig. 1, 
right). Thus, for given attributes like region, street, and city, one can implement 
a wrapper which automatically retrieves the corresponding ATM addresses. 

However, if the task is to select one or more specific ATMs based on properties 
visible from the map only (e.g., select ATMs which are close to a hotel), or if the 
selection criterion is imprecise and user-dependent ( “I know what I want when 
I see it”), then an explicit user interaction is required and the source interface 
has to be exposed directly to the user. □ 

In this paper we explore the problem of bringing such sources, i.e., which may 
require explicit user interactions, into the scope of mediated information inte- 
gration. To simplify the presentation, we use a relational attribute-specification 
model for modeling query capabilities. The outline and contributions of the pa- 
per are as follows: 

• In Section 2 we show how the capabilities and interactions of Web sources 
can be modeled in a concise and intuitive way using interaction diagrams. To 
the best of our knowledge, this is the first approach which allows to model 
complex Web sources with explicit user interactions and brings them into 
the realm of the information mediation framework. 

• In Section 3 we show how to derive the combined capabilities of the modeled 
sources given the capabilities of individual interactions. This allows us to 
distinguish between automatable (wrappable) queries and queries which can 
only be supported with explicit user interaction. Based on this, a “diagram- 
enabled” wrapper can suggest alternative queries to the mediator, in case 
the requested ones are not directly supported. 

• In Section 4 we sketch how interaction diagrams can be embedded into the 
standard mediator architecture and discuss the benefits of such a “diagram- 
enabled” system. We give some conclusions and future directions in Section 5. 



Related Work. Incorporating capabilities into query evaluation has recently 
gained a lot of interest, and several formalisms have been proposed for describ- 
ing and employing query capabilities [PGGMU95,LR096,PGH98,YLGMU99]. 
Especially, in the context of Web sources, query processing can benefit from a 
capability-sensitive architecture as noted in [GMLY99]. Their work focuses on 
efficient generation of feasible query plans at the mediator, i.e., plans which the 
source wrappers can support. Target queries are of the form 7rb(cr^(a)(R)) where 
V'(a) is a selection condition over the input attributes a, and b are the output 
attributes. Their capability description language SSDL deals with the different 
forms of selection conditions ip{a) that a source may support. In contrast, we ad- 
dress the problem of “chaining together” queries like 7rb(a^(a)(-R)) which allows 
us to model complex sequences of interactions. 
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Several formalisms for modeling Web sites have been proposed which more or 
less resemble interaction diagrams: For example, the Web skeleton described in 
[LHL+98] is a very simple special case of diagrams where the only interaction ele- 
ments are hyperlinks. Other much more versatile models have been proposed like 
the Web schemes of the Araneus system [AMM97,MAM+98], which use a page- 
oriented ODMG-like data model, extended with Web-specific features like forms. 
Another related formalism are the navigation maps described in [DFKR99]. The 
problem of deriving the capabilities and interaction requirements of sequences 
of interactions (i.e., paths in the diagram) is related to the problem of propa- 
gating binding patterns through views as described in [YLGMU99]. There, an 
in-depth treatment on how to propagate various kinds of binding patterns (or 
adornments) through the different relational operators is provided. 

However, the approaches mentioned above do not address the problem of 
modeling sources with explicit user interactions and how to incorporate such 
sources into a mediator architecture. 



2 Modeling Interactive Sources 

As illustrated above, there are user interactions with Web sources which can be 
wrapped (e.g., link traversal, selections from menus, filling of forms), and others 
which require explicit user interaction (e.g., selections from image maps, GUI’s 
based on Java, VRML, etc.). Before we present our main formalism for modeling 
such sources, interaction diagrams, let us consider how the typical interaction 
mechanisms found in Web sources relate to database queries. 



Modeling Input Elements 

Below, for an attribute a, we denote by $a the actual parameter value for a as 
supplied by the corresponding input element. In first-order logic parlance, we 
may think of a as a variable which is bound to the value $a. Hence, we often 
use the terms attribute and variable interchangeably. We denote vectors and 
sequences in boldface. For example a stands for some attributes ai, . . . ,a„. 

We can associate the following general query scheme with the input elements 
discussed below. 

^b(<7'^(a)('^))’ 

Here, a are the input parameters to the source, b are the desired output param- 
eters to be extracted from the source, and o, 6 € atts{R), i.e., the relation R 
being modeled has attributes o, b. The first-order predicate V* over a is used to 
select the desired tuples from R. With every input element we will associate a 
default semantics, according to their standard use. 



Hyperlinks are the “classical” way to provide user input. We can view the 
traversal of a link href (a) as providing a value $a for the single input attribute 
a. The value $a is given by the label of the link. For example, by clicking on 




Modeling Interactive Web Sources 229 



a specific airport code, we can bind ape {airport- code) to the label of the link. 
Assume we want to extract the zip code from the resulting page. We can model 
this according to the above scheme as Trzip{(Tapc=$apc{R))- Since equality is the 
most common selection condition for links, we define it as the default semantics 
for href(a): 

ip{a) := {a = $a). 

Other meanings are possible. For example, a Web source my contain a list of 
maximal prices for items such that by clicking on a maximal value $price, only 
items for which price < %price are returned. Thus, ‘ip{price) := {price < %jyrice). 
Note that the domain of a is finite, since the source page can have only finitely 
many (static) links. 



Forms can be conceived as dynamic links since the target of “traversing” such 
a link depends on the form’s parameters. In contrast to hyperlinks, forms gener- 
ally involve multiple input attributes, ranging over an infinite domain (think of 
a form-based tax calculator). For example, given a street and a city name, the 
extraction of zip codes from the result page can be modeled by the input ele- 
ment fom{street,city) and the query T^zip{<^street=Sstreet/\city=$city{R})- Similar 
to href, the most common use and thus our default semantics is to view forms 
as conjunctions of equalities: 

tp{a) := (ai = $ai A . . . A a„ = $a„). 

In contrast, by setting ip{low, high) := {$low < price < %high) we can model 
a range query which retrieves tuples from R whose price is within the specified 
interval. Optional parameters can be modeled as well: let ip{a, b) := {{a = $aAb = 
$6) V a = $a). This models the situation that b is optional: if b is undefined (set 
to null), the first disjunct will be undefined while the second can still be true. 



Menus are used to select a subset of values from a predefined, finite do- 
main. For example, the selection of one or more US states from a menu is 
denoted by menu (state). When modeling a menu attribute, it can be useful 
to include the attribute’s domain. For example, we may set dom{state) := 
{ALABAMA, . . . , WYOMING}. In contrast to href(o) and form(o), multiple values 
$ai , . . . , $an can be specified for the single attribute a. The default semantics 
for menus is 

ip{a) := (a = $ai V . . . V o = $a„). 

Non-Wrappable Elements. Conceptually, the elements considered above are 
all wrappable in the sense that the user interaction can be automated by wrap- 
pers, which have to fill in forms (via http’s post), traverse links (get), possibly 
after constructing a parameterized urP, etc. We call an interaction non-wrappable 

^ E.g., search. yahoo. com/bin/sear ch?p=a+OR+b finds documents containing a or b. 
Other elements like radio buttons and check boxes can be modeled similarly. 
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if, using a reasonable conceptual model of the source, an explicit user interaction 
is inevitable to perform that interaction (cf. Section 3). Note that technically, 
selections on clickable image maps could also be wrapped, say using a url which 
is parameterized with the xy-coordinates of the clicked area. However, this is 
usually not an adequate conceptual model of the query, since the xy-coordinates 
do not provide any hint on the semantics of the query being modeled. Thus, for 
sources where automated wrapping is impractical or inadequate from a modeling 
perspective, we “bite the bullet” and ask the user for explicit interaction with 
the source. To model such a user interaction, we use the notation 

!ui(ai,...,o„) 

where ai, . . . ,a„ are input parameters on which the user interaction depends. 
Consider, e.g., an image map on which the user clicks to change the focus or pan 
the image. We can regard this as an operation with some input parameter loc, 
representing the current location, and some otherwise unspecified, implicit input 
parameter x which models the performed user interaction.^ In this case, we can 
model the interaction as t!'Ioc' {(^ ioc=$ioc,x=$x{^)) where R has input attributes 
loc and x and an output attribute loc'. 

Modeling Complex Sources with Interaction Diagrams 

To describe the query capabilities of and dependencies within complex interactive 
sources, we propose a formalism in the spirit of state-transition diagrams, called 
interaction diagrams, or diagrams for short. Conceptually, a node of a diagram 
is viewed as an individual source. A source can be a single Web page or comprise 
several pages which exhibit the same input/output behavior. With every node 
we associate an export schema, representing the information provided by that 
source. The possible transitions between (the states of) sources are modeled by 
labeled edges. The edge label specifies the type of interaction (form, menu, !ui, 
etc.) which is required to perform the transition and determines the capabilities 
of the source wrt. this transition. 

Diagrams. More precisely, an interaction diagram d for a source is defined 
over a set of attributes atts {=atts{d)) and consists of labeled nodes and labeled 
directed edges. Attributes are used to describe the modeled entities of the source 
and thus are the basic building blocks of the source description. In the sequel, 
let a, ai, 02, ... € atts{d). 



Sources. The nodes of d are called sources and model identifiable units within 
the complex source, most notably Web pages. Sources have an identifier (node 
id), typically a url and an output schema of exported attributes oi, . . . , a*,. These 
attributes model all relevant information exported by the source. 

^ E.g., X could be an encoding of xy-coordinates. However, we do not provide or need 
a way to dissect x since it acts merely as an identifier. 
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We distinguish different ways in which attributes can be exported in an 
output schema: 

• (oi, . . . , a„): the source exports these attributes tuple- at- a- time, 

• {(oi, . . . , On)}: the source exports a set of such tuples, 

• [(ai, . . . , a„)]: the source exports an ordered list such tuples. 

This structural information can be used to derive additional query properties of 
sources like (non-)uniqueness of results and availability of order, thereby sup- 
porting the wrapper design. 



Internal Attributes. Sometimes, attribute values are not explicitly provided 
by the source (e.g., the location corresponding to the center of the image in Fig. 1, 
or the attribute loc in Example 2). Although such a source may be “stateful” and 
remember the current value of loc, this value is implicit and cannot be directly 
extracted by a wrapper. Nevertheless, the output of subsequent interactions may 
very well depend on the value of the latent loc attribute, so we cannot ignore it. 
We say that an attribute a is internal, denoted by &a, if the actual value of a is 
not directly extractable. 



Transitions. Labeled edges are used to model possible transitions between 
(states of) sources. Each transition t is labeled with one or more interactions 
i = ii,...,in, where each interaction ij e {href, form, menu, !ui} has input pa- 
rameters from atts{d). Transitions can be further constrained by attaching con- 
ditions as follows. The transition 



t : 



i(<C,u\v] 

X — * y 



means that one can move from source x to y using the interaction(s) i only if 
the first-order condition ip holds. This allows modeling of additional semantic 
constraints which are enforced by a source (e.g., a source having form(fow, high) 
may enforce tp ;= $low < thigh). Note that by default, the input parameters of 
t are the attributes occurring in x or i. In order to indicate that the outcome of 
t also depends on the attributes u but not on v, the expression u\v is used. 

Example 2 (Going to the Bank). The diagram in Fig. 2 defines the following 
model of a complex source; The source at the start node no does not export 
information relevant to the application. By filling in formj with a street and 
city name and selecting a state from menu, we can move to source ni, which 
provides an internal attribute &doc for the requested location (fi). As long as 
we apply user interactions !uii, we remain at source ni, possibly updating its 
internal location attribute (< 2 )- We can move to TI 2 by filling in another form 
form 2 with a positive radius, or by executing some user interaction !ui 2 (say 
marking a rectangular region). Source n 2 exports a list of bank identifiers (say 
as hyperlinks), and a set of internal bank locations (e.g., as a clickable image 
map). There are two ways to move to no: We may either traverse a link which 
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Fig. 2. A diagram for an intereictive source (for retrieving addresses of nearby banks) 

is labeled with the bank id (ts), or we use another user interaction luis (again 
by clicking on an image map). Note that we can ignore the k,bankloc attribute 
in <5 (as specified in brackets), since the bankid is sufficient to execute □ 

3 Deriving Source Capabilities 

While interaction diagrams are a useful modeling tool in itself, their main pur- 
pose is to support the automatic derivation of capabilities of complex interaction 
patterns of the modeled source. 

Single Transitions. The capabilities and interaction requirements of a single 
transition i(¥>,u\v] 

t :x 

are modeled as follows: 

• in{t) := {atts{x) U atts{i) Liu)\v are the required input attributes, 

• out{t) := atts{y) are the output attributes which are exported by t, and 

• act{t) ■.= i A(fi are the interaction (execution) requirements of t. 

Here, atts{. . .) is the set of all attributes occurring in the corresponding expres- 
sion. For each user interaction !uifc, we also add a new internal attribute Szxk 
representing the user interaction, so 

off s(! ui/j (ui , . . . , Un)) .— {oi , . . . , a^i, )■. 

Transition Paths. Based on the above notions, we can derive the query ca- 
pabilities along paths, i.e., sequences of transitions. A path of a diagram d is a 
sequence ti.t2- ■ ■ ■ -tn of connected transitions, i.e., where the target of ti meets 
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the source of tj+i. Let t — ■ ■ ■ .tn and be paths of d (here, t may be 

empty). The capabilities and interaction requirements along paths are induc- 
tively defined as follows: 

• in{ti.t2-t) := m(ti) U [in[t2-t) \ propagate{(pcon{ti,t2))) are the input at- 
tributes, 

• out{ti.t2-t) out{t2-t) are the output attributes, and 

• act{ti.t2-t) := act{ti) 0 <Pcon{ti,t2) ®act{t2-t) is the execution plan {inter- 
action requirement). 

Here, we have assumed that 

• V^con (^1,^2) is a first-order predicate over out{t\) U in{t2) which “connects” 
the output of t\ with the input of t2- By default, tpcon{ti , <2) is defined as the 
natural join over the common attributes of out{t{) and in{t2). This default 
can be overridden by attaching an explict condition at the node connecting 
t\ and t2, 

• propagate{(ficon{titt2)) are those attributes of 172(^2) whose bindings are 
propagated from out{t{) using ^Pconih^h)- Thus, in the default case with 
natural joins: 



propagate{(ficon{ti,t2)) '■= out{ti) C\in{t2) 

• “ 0 ” is a binary, right-associative connective denoting serial conjunction.^ 
Thus, an execution plan has two readings, both of which have to be obeyed: 
(i) as a logic formula where the first-order conditions <p and (fcon of the 
plan are viewed as conjunctively connected, and (ii) as a linear sequence of 
required interactions i. 

The query capabilities along a non-empty path t = t\. ■ ■ ■ .tn is then denoted by 
the binding pattern 

qt{in{t),out{t)). 

Below, we “sign” an attribute a to indicate whether it is input (-fa), output 
(—a), or both (±a). 

Example 3 (Banks Revisited). Consider the diagram in Fig. 2 . It is easy to 
derive the binding patterns 

• Qti {+street, -\-city, +state, —Sdoc), and 

• qt3 {-\-&eloc, -{-radius, —bankid, —Szbankloc). 

We can connect t\ and ts via n\ by a first-order predicate g^conititt^) which, by 
default, is set to the natural join over the common attributes, so ffconiti^ts) = 

qti (-{-street, -{-city, -{-state, —Moc) ^shoc qt3(-{-^loc, -{-radius, —bankid, —kibankloc), 

from which we derive the binding pattern 

9ti-t3 — {-{-street, -{-city, -{-state, -{-radius,— bankid, —iibankloc). 

This notation and terminology is borrowed from Transaction Logic [BK94]. 
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Now consider Since ts’s only input attribute, bankid, is exported by the 
target node of we can connect ti.ts and ts via U2 without the need to 
supply additional input parameters. From this we obtain 

qt^.t3.t:i{+street, +city,+state, +radius, —bname, —bstreet, —bcity) . 

The execution plan necessary to implement gt1.t3.t5 is 

act{ti.tz-H) — formi(stree<,cftt/), menui(stote) 

^ V^con(f 1 y ^3) 

<S> form2 (radius) A radius > 0 

® 'Peonies y ^5) 

(gi href(6anA:zd) 



where Pconitsyh) is the natural join gtsC- • •) ^bankid Qui- ■ ■)■ n 

By modeling complex interactive sources using diagrams, we can derive the 
different query capabilities and associated interaction requirements which are 
necessary to implement these queries at the sources. Indeed, from the above 
definitions one can easily derive an algorithm which, given a diagram d and a 
non-empty transition path t of d, computes the unique binding pattern qt- This 
is an important prerequisite for enabling capability-sensitive query processing in 
a mediator framework (cf. Section 4 ) and allows us to determine the supported 
(or feasible) queries of a source: 

We say that a binding pattern q{in,out) subsumes a pattern q' {in' , out') , if 
in C in' and out D out' (since we can answer q' by resorting to g); g and q' are 
called equivalent, if they subsume each other. 

A query q{in, out) is supported by a source if there is a path t in the cor- 
responding diagram such that gt subsumes g. Finally, g is called wrappable (or 
automatable), if it is supported by some path t such that act{t) does not contain 
any user interaction !ui. 

Example 4 (Properties of Transition Paths) . Assume the mediator sends 
a request of the form q{+street, +city, —bname) to the source modeled in Fig. 2 . 
This query is not feasible since there is no path in the diagram which supports 
it. However, a “diagram-enabled” wrapper can determine close matches to this 
request and suggest qt1.t3.t5 which yields the desired output if the mediator can 
come up with a plan that provides the inputs -Estate and +radius. 

Another close match is qt1.t3.te which does not need an input value for radius. 
However, this path is not automatable since it involves an explicit user inter- 
action luia. Thus, the first alternative is usually preferable (unless the meditor 
cannot provide the additional input parameters). a 

Note that when modeling a site with interaction diagrams, the designer of the 
diagram has to ensure that the binding patterns derivable from the diagrams 
correctly reflect what is being computed by the sources, i.e., their semantics. 
For example, a binding pattern g(-l-a, +t, —b) could mean “find books b where 
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author=a or title=t”, but it could also mean where author=a and title=t”, 
or even "... where author^ a if title=t”, etc. Dealing with these different possible 
semantics is beyond the scope of this paper and we just assume that attribute 
names and transitions between sources have been modeled accordingly. 



4 Using Interactive Sources in the Mediator Framework 

In the previous section we have defined formal properties of interaction diagrams 
and shown how they can be used to model complex sources involving multiple 
pages and sites. In this model, the capabifities of a source, i.e., the supported 
queries and interaction requirements, are characterized by the paths in the di- 
agram. In the following, we explore how a wrapper using interaction diagrams 
can participate in the query evaluation process controlled by the mediator. We 
present different scenarios of interplay between the mediator and the wrapper, 
with increasingly complex dependence on the properties of the diagram. 



Schema Only. First consider a scenario where the mediator is unaware of 
the binding patterns derivable from the interaction diagram. The mediator only 
keeps an account of the attribute names from a wrapper and passes a complete 
query to the wrapper. The wrapper uses the graph to determine if the query 
is supported. If the query is not feasible, the wrapper returns with an error, 
and the mediator tries to send a different rewrite of the query. In this case, the 
mediator does not have any knowledge of why the query failed, and hence cannot 
make any intelligent choice for the next rewrite. Although this approach makes 
the mediator-wrapper interactions very simple, this is potentially a very costly 
solution, and does not utilize any benefit of the interaction diagram. 



Summary Table. In this scenario the mediator knows the role each attribute 
can play (input, output, both), but has no knowledge of the combination of 
attribute-role pairs that are permitted by the source. In this case the mediator 
maintains a summary table of the form 

[±a, +b, ±c, -d, +e, . . .] 

which approximates the capabilities of the source. The query q{+a, +b, — c, —d) is 
supported according to this table and thus should have a set of equivalent paths 
by which the pattern should be satisfied. However, the source may actually sup- 
port only the binding patterns qi{+a, +b, — c) and q 2 {+c, —d) and the mediator 
has to compute qi IXc 92 - To keep track of such situtations, [YLGMU99] precom- 
pute and maintain all possible binding patterns for the source at the mediator. 
Clearly, all such patterns can be automatically generated from the diagram, 
and registered with the mediator when the wrapper is first initiated. However, 
for a complex Web source this may be too large (in the worst case 3” for n 
attributes), and hence the solution does not scale well. We could reduce this 
number by eliminating subsumed binding patterns as in [YLGMU99]. In this 
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case, one has to ensure that wrappable paths are prefered over non-wrappable 
ones. If the reduction by subsumption restricts the space of binding patterns 
to a manageable degree, then this is an acceptable solution. Since the mediator 
has complete knowledge of the wrapper’s capabilities, it can be guaranteed to 
produce only feasible plans. 



Active Wrapper. In case the previous solution still creates an unacceptable 
number of binding patterns at the mediator, the wrapper needs to take a more 
active part in query evaluation. Now, since the mediator does not have all the 
binding patterns, the wrapper has to evaluate the query by further decomposing 
it into subqueries. We sketch a method to accomplish this at the wrapper in the 
following way. Given a binding pattern q: 

• The diagram-aware wrapper traverses the interaction diagram to find all 
paths that subsume the query. A path here corresponds to the execution 
plan defined in Section 3. 

• If there is only one such path, the wrapper can immediately execute the 
query, since the mediator has no decisions to make in that query. 

• If however there is more than one path, the wrapper has the choice to either 
send all of them to the mediator, or prune them by some local heuristic to 
reduce the number of viable execution plans. We have found the following 
pruning heuristic to be effective: 

- If there is a single path with no user interaction !ui, choose that path. 

- If there are multiple paths without lui’s, rank the paths by path length. 

— If all paths have one or more lui’s, rank the paths first by the length 

of the path, and then by the number of lui’s. The intuitive idea is first 
to reduce the number of pages to visit, and then reduce the number of 
forms or menus to fill in. 

• Send the ranked list of paths to the mediator. The ranked list acts as a 
qualitative cost estimator for the execution plan.® 



Touting Wrapper. Finally, consider a variant of the above scenario where 
the wrapper aids the mediator by providing additional hints (cf. Example 4). 
Again assume that the query is q{+a, +b, — c, —d), but suppose the binding pat- 
terns supported by the source are qi(+a, -1-6, — c) and 92 (+c, +e, —d). The query 
will obviously fail, but the mediator does not know that the query would have 
succeeded if attribute e had been provided. 

There are two possible outcomes if the wrapper sends back this hint to the 
mediator. First, the mediator may already have a constraint on e, but this con- 
straint was never passed to the wrapper, because the mediator found a second 
source that also uses e, and hence cleanly separated the attributes between 
sources. In this case, the mediator needs to rewrite the query by reusing the 
attribute e for the first source also. In the second case, the original query is 

® The mediator may also cache the alternatives returned to reduce the number of 
interactions with the wrapper in subsequent queries. 
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under-specified and the mediator returns to the user with a list of missing at- 
tributes, thus making the query evaluation more cooperative. To implement this 
solution we can modify the above method with a parameter k, such that only 
paths with up to k missing attributes will be searched. 



5 Conclusion and Outlook 

We have proposed a method for modeling the capabilities of complex interactive 
Web sources using interaction diagrams and have shown how the standard input 
elements of Web sources (forms, menus, etc.) can be modeled as restricted rela- 
tional queries with a certain input /output pattern. These elements are the basis 
for our formal model of interaction diagrams, where they are used to specify the 
interaction requirements of transitions between sources (=Web pages having the 
same input /output schema). A “diagram-enabled” wrapper can chain together 
transitions, thereby supporting more complex queries than those provided by 
the individual subsources. This is possible because the query capabilities and 
interaction requirements (i.e., plans for executing the query) of paths of transi- 
tions can be automatically derived from a diagram.® In particular, this allows 
to examine alternative execution paths and determine those with minimal user 
interaction. 

The last two scenarios discussed in the previous section demonstrate the ad- 
ditional value of the diagram-based representation of wrapper capbilities in the 
mediator framework. It also shows a departure from the more commonly used 
thin-wrapper heavy-mediator model to a more negotiation-oriented interopera- 
tion between the two components. Such a negotiation will be useful as we move 
from modeling simple Web sources whose capabilities can be manually modeled 
to more complex, dynamic Web sources, for which creating a static exhaustive 
set of capabilities is impractical. 

We plan to integrate our approach into the MIX^ mediator system, which 
employs a virtual integration approach where queries are decomposed at runtime 
and sent to the source wrappers. Additionally, query evaluation in the MIX 
mediator system is lazy (or on-demand), i.e., driven by the clients navigation into 
the virtual answer view [LPV99]. Interestingly, by incorporating the proposed 
diagrams into this mediator architecture, user interactions and query evaluation 
become mutually dependent; When the user issues a query or navigates into a 
view, the mediator decomposes the request and send subqueries to the sources. 
Some of these queries may be infeasible without additional user interaction. In 
these cases, the source “calls back” the user and requests additional input before 
query evaluation can proceed. 

Acknowledgments. The first author thanks Birgitta Konig-Ries for numerous 
and detailed comments on the paper. 



® A first prototype for analysing transition paths has been implemented. 
^ Mediation of Information using XML [MIX99,BGL'^99] 
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Abstract. In this paper we argue that Web applications are a partic- 
ular kind of hypermedia application and show how to model their navi- 
gational structure. We argue that if we need to design applications com- 
bining hypermedia navigation with complex transactional behaviors (as 
in E-commerce systems), we need a systematic development approach. 
We present the main ideas underlying the Object-Oriented Hypermedia 
Design Method (OOHDM) and show that Web applications are built as 
views of conceptual models. We present the abstraction primitives used 
to design conceptual and navigational structure of Web applications and 
describe the view definition language. We introduce navigational con- 
texts as the structuring mechanism for the navigational space. Further 
work on designing Web applications with OOHDM is also presented. 



1 Introduction: Web Applications Are Hypermedia 
Applications 

The emergence of the World-Wide Web has made the hypertext paradigm more 
popular than ever. Web applications combine navigation through a heteroge- 
neous information space with operations querying or affecting that information. 
The Web is based on the hypertext paradigm, inasmuch as it is composed of 
pages (in HTML) linked together through URLs (links). Regardless of a page 
has been reached, users normally have the option of accessing pages linked to 
the current page; by choosing a particular link, the page pointed to by the link 
will be displayed; this process can repeat indefinitely. This succession of steps is 
know as “navigation” , and is intrinsic to hypertext, and hence to the Web. 

However, this second generation of hypermedia applications is rather different 
from the first one, in which applications, usually delivered in CD-ROMs, were not 
supposed to be updated and, in general, were not critical for any organization. 
Web applications, on the other hand, are constantly modified, are permanently 
enriched with new services, and new navigation and interface features are added, 
e.g., according to the organization’s marketing policy. 

In this paper we argue that good Web applications should be, first of all, good 
hypermedia applications, i.e. they should provide easy navigational access to 
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large information resources, preventing users from being lost in the cyberspace, 
and providing consistent navigation operations even when other kind of transac- 
tional behavior is involved. As navigation problems have been largely discussed 
in hypertext literature (see for example [6]) we should be able to reuse existing 
knowledge on building good Web applications. 

Unfortunately, state-of-the art conceptual modeling approaches neglect navi- 
gation modeling as they do not provide useful abstractions capable of easing the 
task of specifying applications that embody the hypertext metaphor. For exam- 
ple, they do not provide any notion of linking and very little is said about how to 
incorporate hypertext into the interface. For example, we could easily model the 
domain of an electronic commerce application using UML [14]. However we can 
not specify critical aspects for this kind of application, such as which nodes will 
be navigated or which paths or indexes the application will contain. Even if we 
specify all this hypermedia functionality using UML primitives, we will be using 
low-level primitives whose semantics were not intended to model navigation. 

At the same time, we could model this kind of applications by considering 
navigation as just another kind of interface behavior; this is the approach fol- 
lowed by some recent (object-oriented) tools like VisualWave [15]. In this case, 
applications built using the well-known model- view-controller interface metaphor 
are published in the Web by just translating views into HTML pages; only some 
aspects related with concurrent access with shared databases are taken into ac- 
count. However this approach fails to consider the most powerful feature of the 
Web: its linking capabilities. 

If we want to profit from the potential of the Web platform we need to 
consider both aspects of Web applications: navigation and transactional (or other 
kind of conventional) behaviors. 

Web applications provide a powerful mechanism for building different views 
(in fact navigational views) to corporate databases. For example, while customers 
access the Amazon.com bookstore using a particular Web interface, managers or 
technical staff can access the same information resources through a different Web 
application (and obviously different access rights) in an Intranet. However, these 
views are more than simple database views as they involve different navigation 
paths, indexes, etc. In this paper we show how to design Web applications as 
views of (shared) conceptual models. In addition, it will be argued that the links 
provided for navigation are more than a representation of conceptual relations, 
as a more naive approach would suggest. 

To summarize the discussion above, we can intuit that there are distinguish- 
ing features in Web applications that present new design requirements vis-a-vis 
traditional systems. In a broad sense, we can categorize them in three groups. 
The first group of design issues has to do with navigation, addressing questions 
such as: What constitutes an “information unit” with respect to navigation? How 
does one establish what are the meaningful links between information units? 
Where does the user start navigation? How does one organize the navigation 
space, i.e. establish the possible sequences of information units the user may 
navigate through? If we are adding a WWW interface to an existing system, 
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how do we map the existing data objects onto “information units”, and what 
relationships in the problem domain should be mapped onto links? 

The second group of design issues has to do with organization of the interface, 
addressing questions such as: What interface objects the user will perceive? How 
do these objects relate to navigation objects? How will the interface behave, as 
it is exercised by the user? How will navigation operations be distinguished from 
interface operations and from “data processing” (i.e., application operations)? 
How will users be able to perceive location in the navigation space? 

The third group of design issues has to do with implementation, addressing 
questions such as: How are information units mapped onto pages? How are nav- 
igation operations implemented? How are other interface objects implemented? 
How are existing databases integrated into the application? 

In this paper we will concentrate on discussing our approach for solving the 
first group. The rest of this paper is structured as follows: we first introduce the 
Object- Oriented Hypermedia Design Method. We next discuss how we build 
navigational models as views on conceptual models; then we introduce naviga- 
tional contexts as a structuring mechanism for navigation. Finally, we discuss 
some ongoing work on mining navigation patterns and present some further work 
on designing and implementing these kind of systems. 



2 The OOHDM Design Framework 

The Object-Oriented Hypermedia Design Method [11] is a model-based approach 
for building large hypermedia applications. It has been extensively used to design 
different kinds of applications such as web sites and information systems, interac- 
tive kiosks, and multimedia presentations. It should be stressed the OOHDM has 
been applied outside the academic environment, such as in government agencies, 
telecommunications companies, oil companies, IT service companies, etc. 

OOHDM comprises four different activities namely. Conceptual Design, Nav- 
igational Design, Abstract Interface Design and Implementation. During each 
activity a set of object-oriented models describing particular design concerns are 
built or enriched from previous iterations. 

We explicitly separate conceptual from navigation design since they address 
different concerns in Web applications. Whereas conceptual modeling and design 
must reflect objects and behaviors in the application domain, navigation design is 
aimed at organizing the hyperspace taking into account users’ profiles and tasks. 
Though applications views are not new in the literature [2], the hypermedia 
paradigm as it appears in the Web raises additional concerns such as orientation, 
cognitive overhead, etc. that should be treated in a separate design activity. 

Considering conceptual, navigational and interface design as separate activi- 
ties allows us not only to concentrate on different concerns at a time, but mainly 
to obtain a framework for reasoning about the design process, encapsulating 
design experience specific to each activity. As we explain below, navigational 
design is a key activity in the design and implementation of Web applications 
and it must be explicitly separated from conceptual modeling. 
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OOHDM design primitives can be mapped onto non object-oriented imple- 
mentation settings using some simple heuristics [12]. We next discuss the first 
three activities in more detail. 



2.1 Conceptual Modeling 

During this step we build a model of the application domain, using well known 
object-oriented modeling principles and primitives similar to those in UML [14]. 
The product of this step is a class schema built out of Sub-Systems, Classes and 
Relationships. 

We chose UML because it is a modeling standard whose syntax and seman- 
tic are clear and well-understood. The major differences with UML are the use 
of multiple- valued attributes, and the use of directions explicitly in the rela- 
tionships. Aggregation and generalization/specialization hierarchies are used as 
abstraction mechanisms. 

Conceptual Modeling is aimed at capturing the domain semantics as “neu- 
trally” as possible, with little or no concern for the types of users and tasks. 
When the application involves some sophisticated behavior in conceptual ob- 
jects, it may evolve into an object model in the implementation environment. 
However it can be implemented in a straightforward way in current Web plat- 
forms combining for example a relational database with some stored procedures. 
The main thesis in this paper is that the conceptual model may not reflect the 
fact that the application will be implemented in the WWW environment, since 
the actual application model will be built during navigational design. This view 
allows using the same strategy for implementing “legacy” applications in the 
Web, by considering their conceptual model as the product of this OOHDM 
activity. 

Classes in the conceptual model will be mapped to nodes in the navigational 
model using a view mechanism, and relationships will be used to define links 
among nodes, also using views. It will also be shown that there are other links in 
the navigation model that do not correspond to relationships in the conceptual 
model. 

Using a behavioral object-oriented model for describing different aspects of 
Web applications allows the expression of a rich variety of computing activi- 
ties, such as dynamic queries to an object-base, on-line object modifications, 
heuristics-based searches, etc. The kind of behavior required in the conceptual 
model depends upon the desired features of the application. For many Web ap- 
plications, in particular those implementing plain browsing (i.e. read-only) func- 
tionality, class behavior beyond linking functionality is unnecessary and does not 
need to be specified. 

Figure 1 shows the Conceptual Schema for an Academic Department Web 
site. Perspectives (multiple valued attributes) are denoted by enumerating the 
possible types, with a -)- next to a default type. Thus, description: \text+, image] 
means that attribute description has a text perspective (always present), and 
may have also an image perspective. 
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Fig. 1. Conceptual schema of the academic department site 

2.2 Navigational Design 

In OOHDM, an application is seen as a navigational view over the conceptual 
model. This reflects a major innovation of OOHDM with respect to other meth- 
ods, at it recognizes that the objects (items) the user navigates are not the 
conceptual objects, but other kinds of objects that are “built” (through a view 
mechanism) from one or more conceptual objects. Moreover, it is important to 
stress that the user navigates through links, many of which cannot be directly 
derived from conceptual relationships. 

For each user profile we can define a different navigational structure that 
will reflect objects and relationships in the conceptual schema according to the 
tasks this kind of user must perform. The navigational class structure of a Web 
application is defined by a schema containing navigational classes. In OOHDM 
there is a set of pre- defined types of navigational classes: nodes, links, anchors 
and access structures. The semantics of nodes, links and anchors are the usual in 
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hypermedia applications. Access structures, such as indexes, represent possible 
ways for starting navigation. Different applications (in the same domain) may 
contain different linking topologies according to the user’s profile. For example, 
in the Academic Web site we may have a view to be used by students and 
researchers, and another view for use by administrators. In the second view, a 
professor’s node may contain salary information, which would not be visible in 
the student’s view. In section 3 we detail how we specify nodes and links using 
a view definition language. 

The most outstanding difference between our approach and others using ob- 
ject viewing mechanisms is that while the latter consider Web pages mainly as 
user interfaces that are built by “observing” conceptual objects, we clearly fa- 
vor an explicitly representation of navigation objects (nodes and links) during 
design. 

2.3 Abstract Interface Design 

In the Abstract Interface Design activity, we specify which interface objects the 
user will perceive and how the interface will behave. For each node attribute 
(either contents or anchors) we must define its appearance. By distinguishing 
between navigation and interface design we can build different interfaces for the 
same application, and in addition achieve implementation independence. 

During this activity we define what different navigational objects will look 
like, which interface objects will activate navigation, the way in which multime- 
dia interface objects will be synchronized and which interface transformations 
will take place as the user navigates. In OOHDM, we use the Abstract Data 
View (ADV) design approach for describing the user interface of a hypermedia 
application [1]. Though not completely related with the aim of this paper, it is 
important to stress that building a formal model of the interface of Web ap- 
plications is a rewarding activity as user interfaces tend to change even faster 
than navigation topologies. We clearly need a precise design specification to be 
able to support changes smoothly. A complete description of our approach for 
specifying user interfaces can be found in [8]. 

2.4 Implementation 

During the implementation activity we map conceptual, navigation and interface 
objects onto the particular runtime environment being targeted. When the tar- 
get implementation environment is not fully object-oriented, we have to map the 
conceptual, navigational and abstract interface objects into concrete artifacts, 
i.e. those available in the chosen implementation environment. This may involve 
defining HTML pages (or, for example, Toolbook objects in non Web-based 
environments), scripts in some language, queries to a relational database, etc. 
Notice that even in object-oriented environments like VisualWave [15] there may 
be no significative difference among conceptual and navigation objects which will 
act as models of Smalltalk’s interfaces. Meanwhile, in a more “hybrid” environ- 
ment, conceptual objects will be mapped to a persistent store (files or relational 
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databases) while the interface and navigation objects will be implemented as 
conventional Web pages. 

In the following sections we discuss the OOHDM approach for defining the 
navigational structure of Web applications. 



3 Specifying Navigational Objects as Views 

One of the cornerstones of the OOHDM approach is the fact that most navi- 
gational objects (nodes and links) are explicitly defined as views on conceptual 
objects, according to each different user profile. These views are built using an 
object-oriented definition language that allows to “copy and paste” and/or filter 
attributes of different (related) conceptual classes into the same Node class, and 
to create Link classes by selecting the appropiate relationships. 

In the academic site example we may want that nodes representing Papers 
contain an attribute with the name of the Professor that teaches that course, an 
eventually use that name as an anchor to the Professor’s home page. It is clear 
that in the conceptual model the name of the professor is an attribute of Class 
Professor and should not be included in Class Paper. As another example, in 
a different view, we may want to filter some attributes (such as the professor’s 
salary, for example) or include new relationships as links. 

Node classes are defined using a query language similar to the one in [5]. 
Nodes possess single typed attributes, link anchors, and may be atomic or com- 
posite. Anchors are instances of Class Anchor (or one of its sub-classes) and 
are parameterized with the type of Link they host. The object-oriented nature 
of nodes and anchors allow re-defining their opening and activation semantics, 
allowing their customization to different application domains. 

The syntax for defining Node classes is given in Figure 2, where name is the 
name of the class of nodes we are creating; classN ame is the name of a Con- 
ceptual Class (from which the node is being mapped) — it is called the Subject 
class; nodeClass is the name of the super-class; attri are the names of attributes 
for that class, type the attribute’s types; namci are the subjects for the query 
expression and varNamCi are mute variables used to express logical conditions; 
logical expression allows defining classes whose instances are a combination of 
objects defined in the conceptual schema when certain conditions on their at- 
tributes and/or relationships hold; anchi are names of anchor variables (Anchor 
is the abstract class for all anchors); and linkTypCi are link types qualifying 
anchors. 

Nodes implement a variant of the Observer design pattern [3] as they ex- 
press a particular view on application objects. Changes in conceptual objects 
are broadcast to existing observers, while nodes may communicate with concep- 
tual objects to forward them events generated in the interface. 

For example, we would define the Node class CourseOffering as shown below, 
including as an attribute the name of the teacher and an anchor for the link 
connecting both nodes. We say that the conceptual class CourseOffering is the 
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NODE name [FROM className: varName] [INHERITS FROM nodeClass] 
attri: typei [SELECT namei] [FROM class\:varN ame\, classj: varNamcj 
WHERE logical expression] 
attri: typei [SELECT namei]... 

•••) 

attrn'. typen [item| 

anch\: Anchor [linkTypei] 
anchi'. Anchor [linkTypei] 

END 

Fig. 2. Syntrix for defining node classes 

subject of Node class CourseOffering. Note that in OOHDM we defer the decision 
of defining the anchor’s appearance until the abstract interface design activity. 

NODE CourseOffering [FROM CourseOffering: C] 

professor: String [SELECT Name] [FROM Professor:? WHERE P teaches C] 
.... {other attributes “preserved” from the conceptual class CourseOffering) 
taughtBy: Anchor [TaughtBy] 

Links connect navigational objects and may be one-to-one or one-to-many. 
The result of traversing a link is expressed by either defining the navigational 
semantics procedurally as a result of the link’s behavior, or by using an object- 
oriented state transition machine similar to Statecharts. Since Web applications 
usually implement simple navigation semantics (closing the source node and 
opening the target), we do not discuss this issue further. 

Access structures (such as indices or guided tours) are also defined as classes 
and present alternative ways for navigation in the application. Application Links 
are also defined as views on conceptual relationships (see the discussion on Con- 
text Links in Section 4). Access structures are usually defined in Navigational 
Contexts (see Section 4), and they are specified by defining the target naviga- 
tional objects and the selectors (usually attributes of the targets). The syntax 
for defining Link classes is shown in Figure 3 (we avoid describing link attributes 
and behavior for the sake of simplicity). 

Using the syntax of Figure 3 we may define the Link class TaughtBy as 
shown below (the qualifier “S.” indicates the subject of the corresponding node 
classes, in this case CourseOffering and Professor). Notice that the conceptual 
relationship “taught by” between CourseOffering and Professor may not exist in 
the conceptual schema and we should carefully plan the final implementation of 
this view. 

LINK TaughtBy 
SOURCE: CourseOffering: c 
TARGET: Professor: p 
WHERE S.p teaehes S.c 
END 
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LINK name 

SOURCE: sourceN ode: sourceVar 
TARGET: targetN ode: targetVar 
WHERE logical expresion 
END 

Where name indicates the name of the Link Class; sourceN ode is the name of the 
source Node class; tar get Node is the name of the target Node class; sourceVar, 
targetVar are mute variables used in the logic 2 il expression logical expression in- 
dicates a condition that involves the Subjects of Source, Target and perhaps other 
conceptual classes. 

Fig. 3. SyntEix for defining link classes 

The navigational schema contains a diagrammatic description of the relation- 
ships among nodes. Each navigational schema represents the model of a different 
Web application. 

It is important to stress the similarities and differences among the conceptual 
and navigational schema. They are similar because both are abstract and imple- 
mentation independent and they represent concepts of the underlying application 
domain using objects. However, while the former should be neutral with respect 
to navigation, the latter expresses a particular user’s view (in the navigation 
sense) that is strongly influenced by the tasks he is supposed to perform. 

OOHDM enforces a clear separation between the specification of navigation 
and other application behavior. However, in complex Web applications it may 
be necessary to integrate both kinds of behaviors (electronic stores are a good 
example of this need). Since nodes implement Observers they can communicate 
easily with their conceptual counterpart in order to delegate actions they cannot 
perform (as for example, modifying a persistent store). 

Nodes and Links are the basic primitives of Web applications. Howeve,r 
we need higher level abstractions to build meaningful and usable navigation 
structures, since we may need to introduce nodes and links that do not reflect 
conceptual entities or relationships and that may be defined opportunistically 
for improving navigation. We next introduce Navigational Contexts, a powerful 
mechanism for structuring the navigational space. 



4 Structuring the Navigational Space with Contexts 

Web applications usually contain sets of pages dealing with similar concepts, 
e.g.: books from an author, CDs performed by a group, hotels in a city, etc. 
These sets may be explored in different ways, according to the task the user 
is performing. For example, in an electronic bookstore he may want to explore 
books of an author, books on a certain period of time or literary movement, etc. It 
is also desirable to give him different kinds of feedback in diflFerent contexts, while 
allowing him to move easily from item to item. For example, it is not reasonable 
that if he wants to explore the set of all books written by Shakespeare, he has 
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to backtrack to the index (the result of a kc3nvord search for example) to reach 
the next book in the set. 

As a result of organizing navigation objects into sets, many navigation oper- 
ations refer to intra-set navigation, most notably “next” , “previous” and “up” . 
Therefore, sets define links that allow such navigations, and these links have no 
direct counterparts in the conceptual model. In other words, there is no concep- 
tual relationship that directly translates into intra-set navigation links. 

Unfortunately, most modeling approaches ignore sets as first-class citizens 
and therefore operations such as “next” and “previous” are not usual while 
traversing sets. To make matters worse, the same node may appear in different 
sets: e.g. a book written by Shakespeare may appear in the set of Romantic books 
or in the set of books written in England. We may even want to include some 
comments about the book in the corresponding context, e.g.: when accessed as 
a romantic book, some comments about the role of the book in the romantic 
period. 

OOHDM structures the navigational space into sets, called Navigational Con- 
texts, represented in a Context Schema. Each Navigational Context is a set of 
nodes and it is described by indicating its internal navigational structure (e.g. if 
it can be accessed sequentially), an entry point and associated indexes. Gener- 
ally speaking, contexts are defined by properties of its elements, which may be 
based on their attributes or on their relations, or both. There are four special 
cases that occur more frequently when stating such properties to define contexts 
in OOHDM: 

1. Simple class derived — includes all objects of a class that satisfy some prop- 
erty ranging over their attributes; e.g. “professors with rank = associate” . 
Graphically: 

2. Class derived group — is a set of simple class derived contexts, where the 
defining property of each context is parameterized; e.g. “professors by rank” , 
“paintings by painter” (rank and painter can vary). Graphically, same as 1. 

3. Simple link derived — includes all objects related to a given object; e.g., 
“courses taught by Professor Smith”, “exhibitions where Sun Flowers was 
presented”. Graphically, same as 1. 

4. Link derived group — a set of link derived contexts, each of which is obtained 
by varying the source element of the link; e.g. “courses taught, by professor” , 
“exhibitions by painting” (professor and painting can vary). Graphically, 
same as 1. 

Besides the context definition forms above, there are the following additional 
forms: 

5. Arbitrary — The set is defined by enumeration. For example, a guided tour 
showing some pictures in a museum or some outstanding research projects. 
Graphically, same as 1. 

Contexts may also vary during navigation, either because the reader can 
create or modify information elements (navigation objects), thus affecting the 
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elements of a derived context, or because they can explicitly insert or remove 
objects in the context. Such contexts are said to be dynamic; examples are 

history and shopping baskets. Graphically: 

In any of the above, if there is an access structure 
responding graphical notation contains a small black square in the upper left 
corner. Associated with contexts are access structures (indices). They are de- 
noted graphically by: 



fined for it, the cor- 



Simple Index: 




Dynamic Index: 

Index with multiple orderings: 



L"J1 

TT_“J 



The Navigational Context Schema represents contexts and their access struc- 
tures. In Figure 4 we show the context schema for the academic site. Notice that 
for each Node class (i.e. Student, Professor, Research Result, Research Project, 
Laboratory, etc) we have indicated different kinds of contexts and indexes, such 
as the Main Menu, the Personnel Category Menu, etc. Arrows indicate both 
navigational relationships and possible transitions among navigational contexts. 

When the same node (e.g. Research Project, Professor, etc.) may appear in 
more than one set (context) we need to express the peculiarities of this node 
within each particular context. We may take as a default that “next” and “pre- 
vious” anchors and links are automatically defined for traversing each set; but we 
may also want that some context-sensitive information appears when accessing 
a Professor by research area (for example giving access to the papers he wrote 
in that area). 

In OOHDM this is achieved with InContext classes; for each Node class and 
each Context in which it appears, we can define an InContext class that acts as 
a Decorator [3] for nodes when accessed in that particular context. Decorators 
provide a good alternative to sub-classing, and prevent us from defining multiple 
sub-classes of the base Node class. InContext Classes are organized in hierarchies 
with some base classes already provided by the design framework; for example 
InContext classes defined as sub-classes of InContextSequential inherit anchors 
for sequential navigation and for backtracking to the context index. When we 
do not define InContext classes, a default one is assumed according to the type 
of context defined. It is important to stress that (similarly to context links) 
InContext classes are not directly mapped from the conceptual schema since 
they address a “pure” navigational concern. 

The Navigational Contexts Schema complements the Navigational Schema 
by showing the way in which nodes are grouped into navigable sets. Additional 
node behavior can be implemented in InContext classes. In Amazon.com, for 
example, when we access a book in the context of a query we have an option to 
move it to the shopping basket. When we access the same book in the context 
of the shopping bcisket we should have other different operations to perform. In 
the example of the academic Web site, presenting a “Research Project” for a 
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Fig. 4. Navigational contexts in an academic Web site 

sponsor will show different attributes than those presented when showing the 
same project to another researcher. 

Navigational Objects (nodes, links, contexts, indexes, etc) are documented 
using a set of cards (like CRC cards [16]) that provide complete information for 
implementers. 

Though OOHDM does not pre-suppose a particular implementation strategy 
for mapping contexts to a run-time setting, there are many alternatives whose 
main differences are the amount of “intelligence” in client pages and/or the Web 
server (see [12] for a discussion). Taking into account the increasing trend towards 
“object- orienting” the Web [4], implementing complex navigational structures in 
Web applications directly with object technology is already feasible, for example 
using Java (see, for instance, [7]). 



5 Concluding Remarks and Further Work 

In this paper, we have argued that we need several different design models for 
building Web applications. We have presented the OOHDM approach that com- 
prises four different activities, namely; conceptual modeling, navigational design. 
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abstract interface design and implementation. We have focused on the design of 
the navigational structure of Web applications, and have shown that these appli- 
cations are built as views on conceptual models. We have described navigational 
contexts as a structuring mechanism for improving navigation. In OOHDM the 
navigational schema describes which classes will be navigated and how, whereas 
the navigational contexts schema provides additional information when dealing 
with collections of nodes. 

As in most complex design domains, just a method (and a set of modeling 
primitives) is not enough for coping with the inherent complexity of Web ap- 
plications. We need to understand which recurrent problems we solve and be 
able to reuse good design solutions to those problems. Design Patterns [3] are a 
good strategy for recording and communicating design expertise about recurrent 
problems. 

During the last four years we have been mining design patterns in the hyper- 
media field: we call them hypermedia patterns [9, 10]. We have identified many 
recurrent problems and well known design solutions and have recorded them in 
the form of patterns. Using these patterns we can improve our ability to build 
Web applications; we also simplify the navigational schema as we can use pat- 
terns as higher level constructs, thus reducing the number of connections we 
have among navigational classes. 

As an example, the Landmark navigational pattern indicates that when some 
sub- system should be easily reached from every node in the application, we 
should treat it in an special way and make it perceivable with a standard inter- 
face. Examples of Landmark can be seen for example in the Book, Music, Gift 
and Auction subsystems at Amazon.com, or in global navigation bars in most 
Web applications. 

When we identify a Node as being a Landmark it is understood that we would 
be able to navigate to that node from every page, so we do not need to define 
all links pointing to the Landmark. In such cases not only do we obtain a well- 
behaved application, but we also simplify the navigational schema. Landmarks 
show another kind of links that are not derived from conceptual relationships, 
but are defined opportunistically to reduce navigation effort and complexity. We 
are now working in a project associated with ACM-SigWeb (the ACM special 
interest group on Hypermedia and the WWW) for building an online repository 
of hypermedia and Web patterns, together with examples, known-uses, imple- 
mentations of those patterns, etc. 

We are also enriching OOHDM by introducing the concept of Web applica- 
tions frameworks (i.e. set of abstract design structures that can be instantiated 
for different applications in the same domain); we are defining a notation for de- 
scribing and instantiating frameworks [13]. In this way we can build even more 
abstract conceptual and navigational schemas that may comprise families of re- 
lated Web applications. We believe that our approach may help to obtain greater 
levels of reuse in the design of Web applications, and therefore this will reduce 
development time and costs, by simplifying evolution and maintenance. 
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Abstract. In the design of information services for the web, develop- 
ing navigation structure is one of the most crucial and time consuming 
tasks. Inappropriate navigation structure usually rejects users who can 
be considered as an indicator for the service quality. Adapting navigation 
structure after the implementation of the system is often too expensive 
(in terms of time and effort). 

To support the designer in this task, we propose a prototyping approach. 
The designer is enabled to incompletely specify user scenarios which serve 
to derive early prototypes, but also to alter and refine them during the 
design process. As a basis, we use the entity/relationship model. It carries 
information about data types and semantical knowledge of the applica- 
tion domain as associations between types. Through a specification pro- 
cess, user scenarios are derived from the schema by creating navigation 
structures. 

Therefore, we introduce a basis to specify user scenarios, a rule system 
used for defining derivation rules, emd propose several heuristical rules 
which exploit semantics from the E/R schema to refine scenario specifi- 
cations. 



1 Introduction 

The increasing demand for information services for the web encourages the de- 
velopment of methodologies and tools to efficiently design and maintain these 
systems. While commercially developed and applied tools as, for example, from 
Oracle, Sybase, Informix, or Allaire, mainly offer the technical basis for the im- 
plementation, logical models at an abstract specification level are missing. This 
circumstance causes known design and maintenance problems. 

To overcome this situation, several proposals as, for example, the HDM 
model [12], the ARANEUS web design methodology [2], a view based approach 
[10], the WebComposition model [13], or the OOHDM method [18] introduced 
logical modeling together with a (semi-)automatic generation of information ser- 
vices. In our paper, we seize this idea. As a logical model, we use the notion of user 
scenarios. In contrast to other approaches [11], besides design and maintenance, 
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we particularly intend to support rapid prototyping. Therefore, we want to sup- 
port the designer in developing logical models starting from the E/R schema. In 
particular, we investigate how far semantics modeled in the E/R schema can be 
exploited to derive navigation structures for user scenarios. 

To delimit the scope of the paper, we will focus on information services which 
mainly concern providing information to users. As commonly understood, the 
following components are essential for information services in the web; naviga- 
tion, information, and presentation. These components affect each other in de- 
sign and maintenance [8,6], While introduced style languages as XSL [20] seem 
to be a promising approach to enable efficient maintenance of the presentation, 
designing and maintaining navigation and information is still a hard task since 
changes at one component usually cause changes at the other. 

Besides the importance of maintenance, the design of information services for 
the web possesses specific traits. For at least two purposes, a rapid prototyping 
design process can improve service quality [9, 14]: 

— User orientation: An expensive service is worthless without any users. 

Therefore, users of the intended user groups should test the service before 
its publication. 

- Aspect of promotion: Besides providing services, web sites often support 
promotion, for example, for a region, a company, a product, or an institu- 
tion. Therefore, the service owner needs to estimate features as its corporate 
identity. 

To enable rapid prototyping in the design process, we apply a standard strat- 
egy. The designer states a minimal scenario specification. To complete the sce- 
nario specification, the system applies predefined heuristical rules — either de- 
fault rules, or rules that include semantics from the given E/R schema. Through 
this approach, prototypes can be derived early which may be adapted afterwards 
by directly adapting the scenario specification or constraining the rules. 

The reminder of the paper is organized as follows. In section 2, we specify 
the notion of user scenarios adapted to the application domain of information 
services. In section 3, we first specify the notion of derivation rules and second 
introduce heuristics to derive user scenarios from a given scenario outline and the 
E/R schema. In section 4, we discuss issues of scenario integration, refinement, 
and maintenance, and the generation of executable and system independent 
prototypes. 

2 Specification of User Scen 2 irios 

To illustrate the design process, throughout the paper, we apply a part of a local 
city information system [19,5]. There, hotels, events, and local sights should be 
available for citizens, visitors, and economy people. We will use a simplified E/R 
schema illustrated in Figure 1. 

For a better readability, we omitted attributes and names of relationship 
types that possess an intuitive meaning. Concerning the cardinality constraints. 
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Fig. 1. Simplified E/R schema covering a part of a city information system 



we apply the participation semantics. In addition, we partially enriched the 
schema by the expected numbers of instances as well as average cardinalities. 

Before we can start to explain the derivation process from the E/R schema 
to user scenarios, we need to clarify the notion of a user scenario. Generally, 
a user scenario describes a sequence of activities (also called story), including 
alternatives, a user needs to follow to fulfil a more or less closed task [4]. The 
task may consist of subtasks and alternative paths. Thus, a specific execution of 
a user scenario, i.e., a use case, corresponds to one possible application of the 
scenario. 

Since we delimit our scope to scenarios which aim in providing information 
to a user, we characterize user scenarios more specifically. Therefore, we divide 
navigation into two distinguishing acts: (i) navigating between associated in- 
formation, and (ii) specifying a selection. To give an example, navigating from 
general hotel information to events offered by the hotel applies to (i), while se- 
lecting a specific time or category of an event applies to (ii). Therefore, we divide 
information into information types as, for example, hotel, sight, or event, and 
classification types as, for example, time, category, or location. For our purpose, 
we apply the following definitions: 



Definition 1 A user scenario consists of a set of information nodes Nj and a 
partial order Oj over Nj with a smallest element, called root (node) . 



The information nodes act as subtasks in a scenario, and the partial order 
defines the navigation structure with a designated information node as the root. 
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Definition 2 An information node I consists of a data type Tj, called informa- 
tion type, a classification C, and a function dj defined over the classification, 
called default function. 

A classification C consists of a set of data types Nq, called classification 
types, and a partial order Oc over Nc- 

A default function dc over a classification C is a partial function defined 
over the set of classification types Nc that assigns to a type Tc € Nc a set of 
values of type Tc- 

The data type of an information node is used for the presentation of infor- 
mation, the classification is used to enable the user to select preferences, and 
through defaults, the scope of scenarios may be further restricted. In the defi- 
nition, the type system is assumed to correspond with the one used in the E/R 
model. 

We want to illustrate the definition by an example scenario: Besides hotel 
information, a visitor might be interested in events offered as well as places 
of interest close by. Thus, an according scenario might first offer to select an 
appropriate hotel through given preferences, and then, offer either to inform 
about contact information or to select events and sights depending on the interest 
of the visitor. 




Fig. 2. Navigation structure of a visitor scenario 



Figure 2 graphically represents the following scenario specification: 

Set of information nodes Ni = {I HoteiJ Event J Sight j ^Contact}\ 

Partial Order O/ = iJ^Hotet-i^Sight) t^^Hoteli^Contact)} 

with root lHotel\ 



Data types ^Contact} 

{Tci ass') '^RoomTypej '^Locationt Txi 



u 



e » ^Category j '^CategoryS } J 










E/R Based Scenario Modeling 257 



Information nodes 

■^Hotel ~ ^Event ~ i'^Eventi^ Event Event} ^ 

^ Sight ~ C^Sight) ^ Sight i^Sight}^ ^Contact~ {'^Contacti ^Contactj ^Contact} 

Classifications Cuotel — i{_'^Class^'^RoomTypei'^Location^ ^ {})» 

CEvent = {{Tt imei Tcategoryt ’^Location } i 

{i'^Timet T'category)i i’^Timet '^'Location)}') i 
^Sight — {{1'categorys} t {})i S’lld 

Ccontact = ({}i {}); and nowhere defined default functions. 

In the figure, dashed boxes represent information nodes, dotted boxes repre- 
sent classifications, inner rounded boxes represent information types, and inner 
angular boxes represent classification types. The arrows between information 
nodes as well as classification types represent the defined partial orders. Addi- 
tionally, the scenario might be enriched by stipulating reasonable default values 
for hotel class or event category. 

The classifications indicate a relationship to multi-dimensional data modeling 
[3,17,15]. Applying this view, the information types represent measures (also 
called facts) and the classification types represent dimensions. Note in contrast 
to dimensions, partial order between classification types indicates an execution 
order instead of a level of granularity. We remark that similar to the idea of 
multi-dimensional data modeling, where measures and dimensions should be 
commutable [1, 16], a data type may be used as a classification type as well as 
an information type. 

While the definitions above only specify the structure, and thus, the syntax 
of user scenarios, we additionally have to specify the semantics. 

An information type corresponds to the set of all instances of the underlying 
data type in the current database state, for example, all hotels acquired in the 
database. A classification type corresponds to the set of all elements in its type 
domain, for example, all hotel classes. 

An information node is used to present instances of its information type. 
Particularly, an information node is to be interpreted as a user interface that 
enables the user to restrict the instance set of the information type through 
selecting subsets of the underlying type domains of the classification types. For 
example, a user interface might offer (i) a form of three list boxes to enable the 
selection of a set of classes and room types and a location, or (ii) a navigation 
hierarchy that first offers all possible locations, then the hotel classes, and finally 
the room types. After the selection, according hotel instances are represented. 

The partial order over the information nodes defines the execution sequence. 
Nodes directly connected to the same ancestor define alternative execution paths 
which are selected by the user. Thus, the order specifies the navigation structure 
of the scenario. 

The partial order over the classification types defines the scenario-dependent 
order which the classification types have to be offered in to the user. Types with 
the same ancestor may be offered in parallel. 

Throughout the execution of a user scenario, a context is maintained. In a 
given state, it consists of the user entries and the defaults “encountered” on the 
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executed scenario path. In particular, the classification types specialize during 
a scenario execution, for example, a user selected hotel location specializes the 
classification type Location in this scenario execution. 

We remark that thus defined user scenarios state a frame for their final 
execution. To generate executable services, additional semantics must be added. 
We postpone the discussion about semantical refinements to section 4. 

3 Derivation of User Scenarios from the E/R Model 

In this section, we investigate how a designer applies an E/R schema to define the 
basic structure of scenarios, and to what extent semantical knowledge modeled 
in the E/R schema can be exploited to derive scenario specifications. To enable 
a flexible derivation system, we define derivation heuristics in terms of rules. 
Thus, on the one hand, the derivation system is expandable, and on the other, 
it is adaptable to different types of E/R models. 

To define an outline of a user scenario, the designer selects a set of information 
types from the E/R schema, either entity or relationship types, and a partial 
order over them. Optionally, the designer may define a set of classification types, 
their ordering, and defaults. The information types are regarded as a skeleton 
of the scenario, since they define the information to be presented, and thus, 
the subtasks of the scenario. We will call the predefined parts of a scenario the 
scenario outline. In the example scenario, the scenario outline might comprise 
the information types Hotel, Event, Sight, and Contact selected from the E/R 
schema in Figure 1, the partial order illustrated in Figure 2, and Location as a 
predefined classification type. 



3.1 Specification of Derivation Rules 

Usually, in early stages of the design process, there exist semantical gaps. There- 
fore, in a given scenario outline, we permit incomplete information about the 
types, the ordering, and the defaults. Our intention is to enrich the semantics 
through the application of predefined rules. Therefore, we define rules that ap- 
ply default assumption on the one hand, and exploit semantics from the E/R 
schema on the other. 

Generally, the approach executes as follows. We start with an empty scenario 
specification and consider the given scenario outline as a pool. Default rules 
first transfer information types together with their ordering from the scenario 
outline into the scenario specification while creating new information nodes and 
an according ordering. Afterwards, schema based rules are applied which refine 
the scenario specification, if specific patterns can be found in the E/R schema. 
Concerning the specification of rules, we apply the following definition: 

Definition 3 A derivation rule consists of an input, an optional matching part, 
preconditions, and actions. 
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The input consists of an E/R schema, a scenario outline, and a scenario 
specification. The matching part consists of a set of (partially specified) infor- 
mation nodes, called node pattern and optionally a (partially specified) E/R 
(part) schema, called E/R pattern, which may contain variables. The precon- 
ditions consist of predicates constraining the domain of the variables, the node 
pattern, the scenario outline and the scenario specification. The actions consist 
of changes of the scenario specification and the scenario outline. 

We only illustrate the semantics of the rules application briefly: First, a 
matching between information nodes of the current scenario specification and a 
node pattern is established. In other words, a specific node pattern is searched 
for in the scenario specification. Second, in the rule defintion, there exists an 
E/R pattern associated with the node pattern. This E/R pattern is matched 
against the complete E/R schema. Finally, the preconditions are verified, and 
the corresponding actions are invoked which extend the current scenario spec- 
ification. Since the E/R pattern is optional, the definition of derivation rules 
enables to express both default rules as well as schema based rules. To clarify 
the rule application, we define straightforward default rules. Schema based rules 
are considered in section 3.2. 

As a default, for each information type chosen in the scenario outline, an 
information node with an empty classification is created. Formally, this default 
rule can be expressed as follows (where prefix User indicates parts of the user 
defined scenario outline); 

Precondition: T/ € UserNj 

Actions: (i) Ni ■.= Nj U {(T/, ({}, {}), {})} 

(ii) UserNi := UserNi \ {T/} 

Similarly, the given order of information types is transfered into the scenario 
specification (where the symbol indicates an unconstrained matching): 

Node Pattern: h = {Tj , ,_,.), /2 = (T/, , ., .) 

Precondition: (T[,,T[^) e UserOj 

Actions: (i) 0/ ;= O/ U {(/i,/ 2 )} 

(ii) UserOj := UserOj \ {{Ti,,Tj/) 

We remark that more complex default rules may be used to restructure user 
scenarios concerning ergonomic requirements as, for example, a maximum/mini- 
mum number of alternative branches. 



3.2 Extraction of Semantical Knowledge from the Schema 

In the following, we explain derivation rules that include semantics from the 
E/R schema into the scenario specification. 

Deriving classification types 

— New classification types can be derived from 1 : n and n : m relationship 
types. They are considered to be potential classification criteria concerning 
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the related information type. For example, Hotel includes RoomType and 
Class into its classification. Using a graphical representation for E/R pat- 
terns, the following rule formally expresses one of the possible cases of this 
derivation: 



Node pattern: I = {Tj, C , .) with C = (Nc, -) 



E/R pattern: 




Preconditions: 

Action: 



(i) Tc i Nc 

(ii) a > 1 

Nc ■= Nc U {Tc} 



— If the E/R schema contains a relationship between an information type and 
a classification type, the classification type is included into the according 
information node. For example, if Hotel is specified as an information type 
and Location as a classification type. Hotel includes Location into its clas- 
sification. The rule definition is similar to the one stated above. It does not 
require the variable a, but an additional precondition Tc 6 UserNc that 
verifies if the classification type is defined in the scenario outline. 

- If there exist many potential classification types, it is advantageous to ex- 
clude less useful types. For this reason, the minimum cardinality of the rela- 
tionship type can be exploited. If the minimum is 0, the according classifica- 
tion type is not always used for classifying the information type, and thus, 
may be omitted. The rule can be expressed by specializing the E/R pattern 
stated above. The cardinality constraint at the left side must be (1, .). 



Deriving ordering of classification types 



— If the schema contains expected numbers of instances, the order of classifica- 
tion types can be derived through ergonomic rules. As a rule, classification 
types with less instances might be priorized, and thus, selected first. The 
following rule formally expresses this derivation (for the case, T/ is an entity 
type): 



Node pattern: / = (T/, C, .) , C = {Nc, Oc) 



E/R pattern: 




Preconditions: 



(i) Tci , Tc2 € Nc 

(ii) i'^Ci ,Tc^), (Tc 2 , Tci ) ^ Oc 

(iii) a <b 



Action: Oc ■= Oc U {{Tc,,Tc^)} 



^2 b 



— Similarly, ergonomic rules can be defined over the expected cardinality num- 
bers. As a rule, a great cardinality number implies a good selectivity of the 
classification type, and thus, it should be priorized. 









E/R Based Scenario Modeling 261 



Deriving information types and ordering 

— Indirect paths between two ordered information types imply the introduction 
of additional information types as connectors. This applies, if there are sev- 
eral n : m relationship types between the information types. The introduced 
connector type is then included according to the given order and creates a 
new information node. Due to its size, the formal definition of this derivation 
which comprises several rules is omitted from the paper. 

The definition of derivation rules can straightforward be generalized. Variables 
within the rule definitions can be used as parameters for the designer. Thus, the 
designer is enabled to specify the heuristics. Furthermore, rule parameters can 
be adapted to specific application domains and reused in the future. 

4 Open Problems and Future Work 

In the following, we summarize open problems which motivate future work. Fur- 
thermore, we briefly discuss issues of service maintenance, and propose a basis 
appropriate for a system independent, executable representation. 

Refining Semantics To derive executable prototypes, execution semantics of 
scenario specifications should be refined. To enable rapid prototyping, default 
semantics can be applied to derive early prototypes, which can be adapted 
later. We want to illustrate two of these refinement opportunities: 
Constraining classification types Besides the classification type, the in- 
put type used for user interaction needs to be constrained [7] . For exam- 
ple, concerning categories, the user might choose a discrete set of values, 
concerning time, the user might choose an interval. 

Context handling Throughout the execution of a scenario, context in 
terms of user inputs and defaults has to be managed. In some scenarios, 
it is not desired that context can only specialize. Therefore, it must be 
permitted to drop parts of the context at specific node transitions. For 
example, when traversing from a chosen hotel to events, the selected 
hotel location might be dropped from the context to not restrict the 
event instances to that location. 

Generating system independent prototypes Currently, there exists many 
different systems that integrate database management systems with the web 
as a user interface. To not specialize to one of these systems, we need to spec- 
ify executable information services in a system independent manner. For this 
purpose, we apply finite state machines whose states contain context values. 
On the one hand, this model is close to our logical model because of the 
context handling, and on the other, it represents the navigation structure 
of the web application. This enables a fairly direct generation of executable 
systems from the logical model. Currently, we are implementing the mapping 
from state machines to Sybase web.sql which offers a Perl interface to the 
database. 
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Integrating scenarios There are many different opportunities to integrate 
scenarios. As a straightforward integration strategy, a virtual root node is 
introduced which serves as a gate to the single scenarios. However, in some 
case, it is advantageous to merge scenarios if they possess information nodes 
with equal types. This integration process should be supported by the sys- 
tem. Furthermore, parameters as, for example, compactness of the navigation 
structure, might constrain the integration to derive ergonomic services. 

Maintaining services Besides its prototyping character, the approach is pro- 
mising concerning maintenance. The process of deriving and adapting proto- 
types already incorporates issues of maintenance. Since scenarios are defined 
over the schema, changes of the schema are propagated into the scenario spec- 
ification. In particular, tedious and error-prone adaption of database queries 
is omitted through the generation of services. Typical maintenance tasks as 
schema changes, extensions of the navigation structure, and changes in the 
presentation can thus be realized efficiently. Flexibility in the presentation 
is achieved through exchangeable style guides, that include the presentation 
during the service generation. Currently developed style languages as XSL 
[20] offer the technical basis. 

5 Conclusion 

The introduced approach supports the designer of information services in model- 
ing and generating early prototypes. It bases on predefined heuristical rules that 
include semantical knowledge from the E/R schema. The following opportunities 
further adapt the approach to be more realistic for practical use. 

In some situations, the E/R schema does not provide an appropriate basis 
for modeling specific scenarios. In this case, the definition of views over the 
schema should be enabled which themselves serve as a basis for the derivation 
process. Furthermore, concerning large E/R schemes, the application of a given 
rule system might be too time consuming. In particular, the matching algorithm 
for graph patterns is exponential. However, user scenarios are usually locally 
bounded concerning the types in the E/R schema. Thus, either the designer or 
the system may restrict the rule system to parts of the schema. 

The opportunity to define derivation rules can be used further to adapt the 
rule system to extended E/R models. Therefore, semantical concepts as, for 
example, aggregation, specialization, generalization, or clusters can be included 
to improve the derivation results. 
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Abstract. The ubiquitous World Wide Web presents a simple inter- 
face for a vast array of heterogeneous information. We see the Web as 
one enabler for what we call superimposed information. Superimposed 
information serves to highlight, annotate, connect emd supplement infor- 
mation in a base information space. Superimposed information is already 
pervasive for the Web, with a variety of models and accompanying ca- 
pabilities. 

In this paper, we introduce superimposed information and analyze a 
range of conceptual models for it using a three-part feature space con- 
sisting of information elements, links, ^lnd marks. Information elements 
in the superimposed layer and links among those information elements 
are analogous to the classicaJ entities and relationships of database mod- 
els. The novelty is in the marks that reference underlying information 
elements. Superimposed information cam serve as proxies for underlying 
information elements, can provide new access paths, £ind can introduce 
new information as well as new links aimong existing information ele- 
ments. We conclude with a discussion of open research questions regard- 
ing superimposed information. 



1 Introduction 

The Web has a daunting wealth of information, and has engendered a wide range 
of aids to help explore and adapt its information; bookmark lists, pages of re- 
lated links, web guides. The particular aids that we mention are all examples 
of superimposed information: data “placed over” existing information sources 
to organize, access, connect and reuse information elements in those sources. 
Some of it has an individual focus, such as the aforementioned bookmark lists. 
Other forms are collective efforts and meant for wide audiences, for example, 
Web guides such as Yahoo. It is often created and used by someone other than 
the creator of the original underlying source. It goes beyond what has tradition- 
ally been labeled “metadata” in the database community, in that it can refer 
to individual information elements in a source, rather than describing entire 
collections or data sets. Superimposed information has existed for centuries in 
hardcopy forms, such as concordances and commentaries on religious books, law 
and literature. 
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While superimposed information is nothing new, we expect its creation and 
use will increase markedly in the coming decade, for a variety of reasons. More 
and more kinds of information are being converted to digital form and placed 
online. Moreover, that information is often addressable at a finer granularity 
than its hardcopy analogs: pages and paragraphs rather than entire books and 
articles; frames and scenes rather than whole movies and videos. The huge vol- 
umes of online information demand alternative groupings and organizations of 
information elements to make it usable and comprehensible by individuals and 
special-interest communities. The low cost with which information can be placed 
on the Internet means that much of it is inaccurate or of questionable value, 
creating the need for annotation and evaluations by others. Finally, emerging 
standards such as RDF, XLink and Topic Maps will facilitate the creation and 
exchange of superimposed information. 

Why do we think superimposed information is an important topic for future 
research? At the most basic level, it is an interesting phenomenon with deep 
historical roots that is being profoundly affected by the digital age. More prag- 
matically, we think a better understanding of the connection between the struc- 
ture of superimposed information and the capabilities it supports will have value 
in designing new superimposed information models and accompanying technol- 
ogy. That understanding can also influence the form and function of standards 
for the representation and dissemination of superimposed information. Defining 
the common architectural elements of superimposed information systems can 
be the basis for frameworks and tools for more easily building such systems. 
Such architectures can also help us understand what is required and desired of 
an underlying information space to better support creation and maintenance 
of superimposed information over it. Finally, managing superimposed informa- 
tion presents interesting challenges for traditional data management technology, 
such as handling dynamically discovered data types, combining structured and 
semi-structured information, and bi-level query processing. 

It is also our sense that most superimposed information today is created and 
used via mainly manual processes. For example, bookmarks are added to a list 
one at a time, and are accessed by scrolling through the list. However, in the 
future we hope to see more automated means of constructing superimposed in- 
formation and more powerful ways of using it. Both involve machine processing 
of superimposed information, and we believe that more precise and more infor- 
mative models of superimposed information are needed to do such processing 
effectively. 

2 Limitations of DB Approaches 

Certainly restructuring and reuse of information has long been addressed in the 
domain of databases, for example, with view and data integration mechanisms. 
However, we want to point out that these traditional approaches don’t capture 
all aspects of superimposed information and that they are not necessarily well 
suited for use on the Internet. 
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Probably the biggest difference between superimposed information and data- 
base-style views is that superimposed information, in general, may contain data 
that is not explicitly present in the underlying sources. While some forms of 
superimposed information, such as that maintained by Web search engines, can 
be derived automatically from the underlying information space, other forms, 
such as Web guides, hold “value-added” knowledge not necessarily contained 
in the base information. View and data integration mechanisms, in contrast, 
typically only contain data that is already present in the underlying sources, 
albeit possibly filtered, regrouped and restructured. 

A second divergence is that database approaches typically require a schema 
for the information sources involved, and often assume a commonality of data 
structuring. For example, much data integration work deals with combining 
sources that are all relational databases. In contrast, on the Web, many in- 
formation sources over which we would like to superimpose information are un- 
structured or semi-structured, with no explicit schema. Further, the forms of 
information are widely diverse. New types of data are showing up online every 
day, which makes the “schema-first” paradigm of DBMSs a liability. A super- 
imposed information system may want to handle new data types as they are 
discovered, without having to anticipate and predefine schemas that are able to 
hold these types. 

A further small point is that database systems have typically provided sup- 
port and services only over data that they directly control. They aren’t set up 
to deal with information “outside the box.” Superimposed information, on the 
other hand, inherently involves connecting with data controlled outside the local 
system [8] . (We note that DBMS features are showing up for managing external 
data, such as DB2’s FileLinks.) 

Finally, we note that traditional data integration techniques have been rather 
heavyweight. They typically require a good deal of up-front work — semantic 
analysis, schema integration, and query mapping — on a source-by-source basis. 
Such approaches don’t seem tenable for connecting pieces of information residing 
in hundreds or thousands of different sources. These approaches are also costly. 
They may be affordable for projects with a company-wide or community-wide 
benefit, but are hard to justify for providing capabilities to an individual or small 
group. 

We believe that database systems and DBMS-related integration technologies 
will play a role in the management of superimposed information. However, there 
is clearly a need to explore other parts of the data management system design 
space for more flexible and lighter-weight technologies as well. 

3 Conceptual Architecture for Superimposed Information 

A superimposed information space is an information space with a base layer 
and a superimposed layer. The base layer consists of information sources where 
each information source may be organized according to a typed structure and 
may have various capabilities for manipulating and presenting information. The 
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superimposed layer is also organized according to a typed structure (of varying 
levels of sophistication). 
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Fig. 1. The Conceptual Architecture of a Superimposed Information Space 



One of the key features of a superimposed information space is its ability to 
reference information in the base layer. Note that we view the superimposed and 
base layers as being conceptually distinct; they may or may not be physically 
distinct. It is also possible that there are multiple levels of superimposed layers. 
We focus here on describing how one layer superimposes over another. 

The conceptual architecture for a superimposed information space is shown in 
Figure I. Each information source, shown in the bottom half of the superimposed 
information space, can be viewed as a preexisting collection of information. An 
information source may have a simple structure such as a collection of HTML 
pages or a more complex structure, e.g., specified by an XML DTD or a database 
schema. Figure 1 shows the structure of the information Sourcej as Schemaj. The 
information that conforms to the structure is shown as Instance^ in the figure. 
Each information source may have a different structure. 

The superimposed layer is shown in the top half of Figure 1. The superim- 
posed layer consists of a schema and an instance, where the schema describes the 
permissible structure of the information in the superimposed layer. The down- 
ward arrows in Figure 1 represents the superimposed information referencing 
information in the base layer. 
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4 Characteristics of Models for Superimposed 
Information 

There are many aspects of a superimposed information space that are of in- 
terest in our research: the capabilities provided for a superimposed information 
space such as navigation and query, the addressing mechanism used to refer- 
ence information elements in the base layer, the universe of discourse or scope 
of a superimposed information space, the presence of implicit and explicit col- 
lections, etc. We focus in this paper on the modeling constructs that appear 
in the superimposed layer. We do so for two reasons. First, one goal of our re- 
search is to provide generic technology for managing superimposed information. 
To do so, we need a good understanding of the range of features in current 
superimposed information models, and others that might be defined. Thus, in 
this paper we have chosen to analyze HTML, Precis, XML, Structured Maps, 
and RDF because they exhibit a wide range of model features. Second, we want 
to be able to judge between alternative superimposed information models for a 
given application. Such judgments will require understanding such issues as the 
representational capabilities of different superimposed models, which is why as 
part of our analysis of the systems above, we try to tease apart their models into 
basic constituents. 

For the purposes of this paper, we deemphasize the model, if any, used in the 
base layer. A base layer must support information elements that are addressable, 
delimited, and renderable, in some fashion. An information element is the atomic 
unit of information (in the base layer) from the point of view of the superimposed 
layer. As an example, one information element in an XML document corresponds 
to a selection delimited by corresponding start and end tags. An information 
element in an XML document can be addressed using an attribute of type ID or 
by other means such as an XPointer. Note that the nested structure of an XML 
document in the base layer is not of direct concern to the superimposed layer; 
we need to know only the contents of the referenced information element. 

For the superimposed layer, we describe the information elements, links, and 
marks where information elements and links are analogous to entities and rela- 
tionships in a traditional entity-relationship model [2]. A mark is a reference to 
an information element in the base layer; marks exist in the superimposed layer. 
As an example, in HTML the href attribute value is a mark. We consider the 
HTML page referenced by the href to be marked. 

For each model that we analyze, we consider whether the model of the su- 
perimposed layer is distinct from the model(s) used in the information sources 
of the base layer, or is required to be the same. 

For each modeling construct, we consider whether the modeling construct 
can be typed. Types are named, where the name carries the semantics of the 
type within the application. We assume the type of a particular instance of a 
construct to be immutable, in that it does not change with time. It is also possible 
for instances of a particular model construct to be classified, that is, labeled 
with a property of interest. The classification of an instance might be derived 
from its value, or be explicitly assigned. For example, we might classify home 
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pages of companies by assigning them “competitor” or “partner” labels. We also 
consider whether the instances of a model construct can be explicitly grouped 
into collections. Note that the ability to create named collections of instances 
results in an implicit classification. If “competitors” is a named collection of 
bookmarks, then membership in this collection might be deemed equivalent to 
assigning the “competitor” label. 

5 Evaluation of Models for Superimposed Information 

We consider here a range of existing superimposed information spaces to demon- 
strate their diversity. We discuss this evaluation in the next section. 



5.1 HTML over HTML 

HTML (over HTML) provides a very simple model for superimposed informa- 
tion. We view the base layer as a set of possibly interconnected HTML [3] pages 
and the superimposed layer as additional HTML pages. This use of HTML (to 
express a superimposed layer) is not as general as HTML because we do not 
consider references from the base layer to the superimposed layer. 

Table 1 lists the criteria used to evaluate models for superimposed informa- 
tion in the first column and provides an evaluation for “HTML over HTML” 
in the second column. We see that the basic model of the base layer consists of 
HTML pages and hrefs. A mark consists of either a URL or an ID. Elements 
marked via an ID are not delimited unambiguously because HTML tagging is 
more concerned with document format than document structure. For example, 
it is not clear whether a mark on an H3 heading encompassed the following text, 
or just the heading content itself The marks are not typed nor explicitly classi- 
fied.^ We are not able to connect two marks (e.g., two URLs) together explicitly. 
Similarly, we are not able to connect two pages together explicitly. 

HTML as a superimposed model is quite weak because there is no support for 
typing, classifying, nor collecting any of the model elements and because hrefs are 
much simpler than relationships found in database models. We include HTML 
over HTML in our analysis because HTML is ubiquitous and because it provides 
a compelling example of how superimposed information can be expressed in the 
same model as the base information. 

5.2 Precis over a Medical Record 

Our second example of a superimposed information space is “Precis over a Med- 
ical Record,” shown in the third column of Table 1 . This model was developed 
while trying to help physicians and other health care providers get an overview 

^ A new page might contain multiple marks (references to other pages) and the text 
of the page may make some statement about how these marks are related but the 
classification is not explicit in the w model of pages and links. 
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HTML Over 
HTML 



Precis Over 

Medical 

record 



1 Characteristics of the Base Layer 


What model is used in the base layer? 


pages/urls 


documents 


What can be marked? 


pages/IDs 


documents 


Is the marked information delimited? 




yes (trivial) 


1 Characteristics of the Superimposed Layer 


What model is superimposed? 


pages/urls 


precis, in bundles 


Is the model of the superimposed layer dis- 
tinct from the model(s) of the base layer? 


not distinct 


distinct 


Marks 


What sorts of marks are there? 


urls/IDREFs 


document id 


Is everything marked? 


no 


yes 


Where do marks reside? 


in new pages 


as precis 


Are marks typed? 


no 


no 


Can marks be classified? 


no 


yes (bundle name) 


Can marks be collected? 


no 


yes, in bundles 


Information 

Elements 


Can new information elements be added? 


yes (pages) 


there are no 
information 
elements other 
than marks 
(precis) 


Are new information elements typed? 


no 


Can new information elements be classi- 
fied? 


no 


Can new information elements be col- 
lected? 


no 


Are there new attributes for information 
elements? 


yes (of ele- 

ments) 


What are the attribute values? 


strings 


Links 


Are there links from mark to mairk? 


no 


no 


Are there links from mark to information 
elements (other than marks)? 


no 


no 


Are there links from information elements 
to information elements? 


no 


no 


Can links be typed? 


no 


no 


Can links be classified? 


no 


no 


|Can links be collected? 


no 


no 



Table 1. Evaluation of HTML/HTML and Precis/Medical Record 



of a patient’s medical record [9]. We refer to this process of getting the gist of a 
patient’s medical history as the familiarization problem. Based on managed care 
and the mobility of both patients and health care providers, it is quite often the 
case that we see individual health care providers for the first time. Each time, 
the health care provider must become familiar with our medical history before 
making a diagnosis or treatment recommendation. 
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Our approach to solving the familiarization problem is to introduce one pre- 
cis^ for each document in a patient’s medical record. Precis include a small 
number of selected attributes from the corresponding document such as date, 
document type, and author. Precis are in one-to-one correspondence with the 
documents in the underlying medical record. The basic idea is that precis can 
be easily manipulated, visualized, and summarized. [Note that we describe pre- 
cis as if they were superimposed over a paper medical record, but the concepts 
and technology apply for an electronic medical record as well.] The health care 
provider might benefit from seeing precis displayed graphically, ordered by date, 
with the color of a precis icon indicating the author’s specialty (e.g., cardiac or 
pulmonary). Such a view might allow the health care provider to quickly “see” 
the major medical events in a patient’s life. 

In our follow-on project [11], we extend our use of precis to capture the 
“footprints” of a health care provider as he or she works with a medical record. 
The idea is to allow for bundles of precis to be created both passively (as the 
health care provider peruses the medical record while diagnosing and treating 
the patient) and actively (where the health care provider actively selects a subset 
of records as relevant for a particular purpose). Any number of bundles can be 
created: a particular precis may be present in zero or more bundles. The goal of 
the project is to reuse the knowledge that is implicit in bundle composition. The 
same health care provider or a collaborative health care provider may benefit 
from knowing the context, in the form of the active bundle(s), for a given health 
care provider’s previous analyses. 

This model is unique among models examined here in that the superimposed 
layer is complete over the base layer; we construct one precis for each document in 
the base layer. The marks, i.e., the precis, are not typed. The model is also unique 
in its support for bundles of precis to support collection and thus classification. 
Note that there is only one type of link: the (trivial) link from a precis to its 
corresponding document. In other words, the only link supported is in the form 
of a mark. 



5.3 XML-based Superimposed Information Spaces 

In this section, we consider “XML over XML” where new (superimposed) XML [4] 
documents can reference existing information elements in other XML documents. 
We also mention how XPointer and XLink extend the XML model. See Table 2. 

XML, as well as SGML, supports nested, richly structured information ele- 
ments in the superimposed (as well as the base) layer. XML supports typing for 
all elements, through tag names, but does not support classification or collection 
of model elements beyond tagging and nesting of elements. 

XML support for links is weak. The ID/IDREF capability provides a unidi- 
rectional link from an attribute to one or more marks. 

^ A precis, according to Webster’s New Twentieth Century Dictionary, unabridged, 
1979, is a “concise abridgment” or “summary.” 
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XML over XML 



1 Characteristics of the Base Layer 


What model is used in the base layer? 


XML documents 


What can be marked? 


element(s) 


Is the marked information delimited? 


yes 


1 Characteristics of the Superimposed Layer 


What model is superimposed? 


XML 


Is the model of the superimposed layer distinct 
from the model(s) of the base layer? 


not distinct 


Marks 


What sorts of marks are there? 


URIs/IDREF (XPointers) 


Is everything marked? 


no 


Where do marks reside? 


in new XML documents (as 
attribute values) 


Are marks typed? 


yes 


Can marks be classified? 


no (by role in XLink) 


Can marks be collected? 


no (except by XPointer or 
multiple XLinks) 


Information 

Elements 


Can new information elements be aulded? 


yes (XML) 


Are new information elements typed? 


yes 


Can new information elements be classified? 


no 


Can new information elements be collected? 


no 


Are there new attributes for information elements? 


yes (of elements) 


What are the attribute values? 


strings 


Links 


Are there links from mMk to mark? 


no (except XLinks) 


Are there links from mark to information elements? 


no 


Are there links from information elements to infor- 
mation elements? 


yes (IDREF) (except 

XPointer) 


Can links be typed? 


yes (attribute or XLink 
name) 


Can hnks be classified? 


no 


Can links be collected? 


no 



Table 2. Evaluation of XML/XML Superimposed Information Spaces 



XPointer [6] provides a more sophisticated way to specify marks using an 
expression. An XPointer expression can traverse the nested structure of an 
XML document including traversal to ancestors, descendants, and siblings. An 
XPointer expression can also select information elements that meet certain cri- 
teria. XPointers are typed by the attribute name where they appear but are not 
explicitly classified or collected. 

An XLink [5] allows two or more marks to be connected directly. The XLink 
itself is typed and each participant reference (i.e., mark) in an XLink is associated 
with a role. As an example, if I create a “business relationship” XLink to con- 
nect suppliers with vendors, I might use role names of “supplier” and “vendor.” 
XLinks can also be used to connect new (superimposed) information elements 
to marks and to each other. XLinks are not explicitly classified or collected. 
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Note that the type of the referenced (i.e., marked) elements is not constrained 
with Id/IDREF, XPointer, or XLink. Also, usual constraints on minimum and 
maximum cardinality, for each role in an XLink, are not expressible. 

5.4 Structured Maps 




that is marked Facet Instance 



Entity Type 
Facet Type 
Entity Instance 

Relationship 

instance 



Fig. 2. Example Structured Map 



The most complex model for superimposed information spaces that we in- 
vestigate is the Structured Map [10], inspired by the Topic Map [1], Topic Maps 
have been developed in the SGML community and are being standardized by 
ISO to provide an integrated table of contents, glossary, and index for one or 
more documents or information sources. Structured Maps/Topic Maps introduce 
an entity-relationship model in the superimposed layer. The main difference be- 
tween the Structured Maps and Topic Maps is that, at least originally, Struc- 
tured Maps were more explicit and rigid about the schema of the superimposed 
information. Figure 2 illustrates a Structured Map to track artists and painting. 

Each entity type may have one or more facets,^ intended to hold references to 
information elements in the base layer. That is, facets are intended to hold marks. 

® We use the terminology defined for Structured Maps (e.g., entity, relationship, and 
facet) rather than the terms used for Topic Maps (e.g., respectively, topic, associa- 
tion, and occurrence role). Note in particular that the term facet denotes different 
concepts in the two models. 
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In Figure 2, we see two entity types and one relationship type for Artist, Painting 
and Painted-by, respectively. The Artist entity type has a facet to reference 
information elements in the base layer that contain biographical information 
about the artist. Painting, on the other hand, h^ls two facets: one for photographs 
of the painting and one for reviews of the painting. Thus each entity is elaborated 
through typed collections of marks on the facets. The Painted-by relationship 
serves to connect Artist instances with Painting instances. 

Structured maps are evaluated in the second column of Table 3. The superim- 
posed model is obviously distinct from the model used for the base layer. There 
is only one kind of mark and it can appear only on facets. Structured Maps 
embody the essence of superimposed information in that they exist to provide 
new access paths to base information. The model for the superimposed layer is 
closest to a database- or ER-style model; note that there is no direct support 
for classification or collection of model elements. (However, our initial prototype 
for Structured Maps did collect entity and relationship instances.) Note also 
that marks cannot exist on their own; they appear on facets for an entity. This 
situation is in sharp contrast to the Precis model; a precis, the main modeling 
construct, is a mark. Another way to view this contrast is to imagine an entity in 
a Structured Map as being forced to have one facet, with a single reference, that 
indicates the underlying information element that this entity represents. This 
limited form of entity in a Structured Map corresponds to a Precis. In general, 
an entity in a Structured Map need not correspond to an underlying information 
element at all. 



5.5 Resource Description Framework (RDF) 

The Resource Description Framework [7], RDF, has been proposed as a mecha- 
nism to add metadata to Web resources. The goal is to automate the processing 
of Web information, e.g., for resource discovery, for content rating, for describing 
collections of pages as a single resources, for expressing privacy preferences, etc. 

RDF uses a model of resources, named properties, and statements represented 
as a (property name, resource, property value) triple. One type of resource rep- 
resents a Web page, referred to as a referenced resource in Figure 3. It is also 
possible to create new resources that have no correspondence with existing Web 
resources, shown as a “new resource” in Figure 3. Resources are described using 
properties. Each property has a value; the value for a property may be a literal 
or another resource as shown in Figure 3. 

The third column of Table 3 shows our evaluation of RDF. A resource (in the 
case of a resource that references a URI) serves as mark. Properties (themselves) 
are represented as resources. Property types are also represented as resources. 
And aggregation constructs such as set and bag are represented eis resources. 
A given resource is declared to be of a certain type by being connected to the 

Types, collections, and classifications are supported through a property that con- 
nects resources to a resource that represents a type, a collection, or a classification, 
respectively. 




Links I Information Elements I Marks 
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Structured Resource 

Map Description 

Framework 



Characteristics of the Base Layer 



What model is used in the base layer? 


Open 


anything with URI 


What can be marked? 


anything with URI 


anything with URI 


Is the marked information delimited? 


depends on base layer 


depends on base layer 


Characteristics of the Superimposed Layer 


What model is superimposed? 


entities, with facets, 
and relationships 


resources with prop- 
erty value pairs 


Is the model of the superimposed layer 
distinct from the model(s) of the 
base layer? 


distinct 


distinct 




What sorts of marks are there? 


URIs 


URIs 


cn 


Is everything marked? (Is the super- 
imposed layer complete over the base 
layer?) 


no 


no 


s 


Where do marks reside? 


on facets of entities 


as resources 




Are marks typed? 


yes 






Can marks be classified? 


no 


yes* 




Can marks be collected? 


yes (implicitly) 


yes“ 




Can new information elements be 
added? 


yes (entities) 


yes (resources that are 
not marks) 


<D 

5 


Are new information elements typed? 


yes 


yes 


s 

0> 

s 


Can new information elements be clas- 
sified? 


no 


4 

yes 


.2 


Can new information elements be col- 
lected? 


yes (implicitly) 


4 

yes 


s 

a 


Are there new attributes for information 
elements? 


yes (of entities) 


yes 




What are the attribute values? 


any domain 


literal 




Are there links from mark to mark? 


indirectly by marks on 
facet(s) for one entity 


yes 


CO 


Are there links from mark to informa- 
tion elements (other than marks)? 


no 


yes 


G 


Are there links from information ele- 
ments to information elements? 


yes (relationships) 


yes 




Can links be typed? 


yes 


4 

yes 




Can links be classified? 


no 


4 

yes 




Can links be collected? 


yes (implicitly) 


4 

yes 



Table 3. Evaluation of Structured Map and RDF models 
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Fig. 3. Model for the Resource Description Framework 



appropriate “type resource” by the appropriate property. Similarly, a resource 
can be declared to be a member of a certain collection by being connected to 
the appropriate “aggregate resource” by the appropriate property. The model 
consists only of RDF descriptions; that is, (property name, resource, property 
value) triples. But the intent of the technology is to provide the ability to search 
for resources according to their properties and property values. Thus there are 
implicit collections of resources and properties. 

Since the value of a property can be a resource, properties can connect marks 
to marks (when they connect two resources that reference URIs), marks to (new) 
information elements (when they connect a resource that references a URI with 
a new resource), and (new) information elements to (new) information elements 
(when they connect two new resources). Since properties are resources, they can 
be typed, classified, and collected as with marks and (new) information elements. 

6 Results from the Evaluation 

First we comment on the intended purpose of the superimposed models described 
above. We note that in HTML and XML the superimposed information is in- 
tended to supply new content (e.g., a new page or document or table) that may 
reference existing content. RDF, on the other hand, is intended to add descrip- 
tive metadata, to existing Web resources. RDF is not expressed in the model of 
the base layer (e.g., HTML or XML) in order to promote faster search of the 
metadata by search engines. (However, standard encodings of RDF into XML 
are defined.) Structured Maps are intended to provide a semantically based index 
for a set of base documents. Note that the entities and relationships introduced 
in the superimposed layer need not be present in the base layer. More than that. 
Structured Maps explicitly allow (new) entities to be elaborated in various ways 
by sets of marks (that is, by sets of references appearing on distinct facets). 
Finally, Precis and RDF also provide new access paths to existing information. 

Next we consider the details of the evaluation and point out some of the dis- 
tinguishing features of the various superimposed models. Among the models used 
in the base layer, only HTML does not fully delimit its information elements. 
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We note that Precis, Structured Maps, and RDF are distinct from the model 
of the base layer. Most of the superimposed models allow marks to be typed; 
HTML and Precis are the exceptions. On the other hand, only Precis and RDF 
allow the model constructs to be explicitly classified and collected. But notice 
that with Precis, the classification follows from the collection (in the form of a 
named bundle of precis), while in RDF the collections follow from the classifica- 
tion (based on properties). We note that every superimposed model considered 
here supports attributes for new information elements. Finally, we note that only 
XLink and RDF support direct connection from mark to mark. Other models 
connect marks only indirectly, by having two or more marks appear within the 
same content. 



SGML 

XML 


1 

1 

1 XPointer 


1 

1 

1 XLink 
1 


1 

1 

1 Precis 


Structured Map 


1 

1 Structured Map 
1 


1 Structured Map 
RDF 


1 

1 RDF 
1 


RDF 


1 

1 Precis 
RDF 


I 

1 HTML, SGML, XML 
1 


1 

1 Structured Map 
XLink 


HTML 

Precis 


1 XML, HTML, SGML 
1 


1 Precis 
1 


1 HTML, SGML, XML 
1 


individual 

information 

elements 


marks 


links 

(not including marks) 


collections 



Fig. 4. Summary of Model Evaluation 



Another way to summarize our evaluation is shown in Figure 4. The first 
column of Figure 4 ranks the various superimposed models according to the 
complexity of the information elements in the superimposed layer. Nested doc- 
ument models are the most complex; database-style superimposed models are 
next; then RDF and finally HTML and Precis, where information elements are 
untyped. From the point of view of marks, shown in the second column of Fig- 
ure 4, XPointer is the most complex; database-style models are next; then Precis 
and RDF (where the marks are simple precis or resources in one-to-one corre- 
spondence with base information elements) and finally XML, HTML, and SGML 
(with URIs appearing as attribute values for other content). With regard to links 
(not including marks), shown in the third column of Figure 4, XLink is the most 
complex because independent, typed, n-ary links can be defined. Database-style 
links and RDF are next most complex because they provide relationships. Doc- 
uments are quite simple; relationships are mainly inherent in nested document 
structure. Precis have no relationships. Finally, support for explicit collections 
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is present in Precis, implied in RDF, indirectly present in Structured Maps, and 
XLink, and not present in HTML, XML, and SGML. 

We can draw several conclusions from this analysis. (1) The complexity of 
the superimposed model varies. (2) Complexity of one modeling construct does 
not necessarily correlate with complexity in another. In fact, we see models 
that omit certain constructs in their entirety. (3) Each of these models could be 
quite interesting to study in its own right, particularly a model that isolates a 
particular modeling features, e.g., the way that the precis (simply) model marks 
that can be collected. 

One might be tempted to develop an expressive model that encompasses all 
structural features for the diverse information we see online. We believe that 
superimposed information spaces provide a useful alternative to a single, global, 
all-inclusive model. What we really want is superimposed technology (e.g., tools 
for RDF, or XML, or Structured Maps) that exploits the various, often simple, 
models of superimposed information. 



7 Open Research Issues 

We think we have only begun to scratch the surface in our investigations of 
superimposed information. At this point we are producing questions at a faster 
rate than answers. We list some of them here. 

1. Are there other forms of superimposed information in use that are not de- 
scribed well by the notions of information elements, marks and links as 
presented here? How do collections and classification fit into the picture? 

2. What is the connection between features of a superimposed information 
model and the capabilities it supports? For example, what model features 
are necessary for query? For navigation? For classification-based search? 

3. How does the form of superimposed information affect the effort required 
for its construction and maintenance? Are there some forms that are easier 
to elicit from users or produce using tools? Are some forms more robust 
in the face of updates in the underlying information space? What forms of 
superimposed information map easily into current information management 
tools, such as relational and object-oriented databases or XML managers? 

4. What are the challenges when the superimposed information has a sub- 
stantially different model from the underlying information sources? We are 
particularly interested in highly structured superimposed information (e.g., 
relational tables) over lightly structured base data (e.g., Web pages) and vice 
versa. 

5. With a bi- level information structure (base and superimposed information), 
one might want tools that deal simultaneously with both levels. One example 
would be a browser from which you can hop up from the base layer to the 
superimposed layer, traverse connections in the superimposed information, 
then drop back down into the base layer. Another is a query processor that 
allows queries that span superimposed and base layers. The latter poses some 
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interesting linguistic and semantic challenges when the models at the two 
levels are different. 

6. How important is it for the superimposed layer to explicitly identify or de- 
limit its universe of discourse in the base layer? That is, should the superim- 
posed layer define the range of base information elements that it refers to, 
or may potentially refer to? (Note that it is useful to know that some range 
of base information was considered but that no relevant elements found.) If 
the universe of discourse is described, what are suitable notations for it? 

7. We suspect that fusing collections of superimposed information might be sub- 
stantially easier than full integration of the underlying information sources, 
especially if the superimposed model is simple. However, this suspicion is 
just an intuition at this point and needs further investigation. 

8. Do the same models and techniques for superimposed data handle superim- 
posed metadata? We are particularly interested in the notion of a “schema- 
later” database, where commonalities in structure or semantics are observed 
after some amount of data has be collected, and a schema is then “retrofitted” 
to the data. 

9. What are the major variations in the superimposed information architec- 
ture we have sketched? One interesting case we have identified is when su- 
perimposed and base levels share a common model (both relational, both 
XML, etc.). With a common model, the levels could be segregated (stored 
and controlled as separate spaces) or commingled (managed in the same 
space). An obvious extension to our architecture is “super-superimposed in- 
formation” : superimposed information over other superimposed information, 
which might be an approach to fusing collections as mentioned in item 6. 

10. How do addressing modes and related capabilities (address comparison, 
value-to-address mapping) in the base space affect the ability to structure 
and process superimposed information? Or turning the question around, how 
can base information be structured and managed to better support super- 
imposed information? 

1 1 . Some current information spaces currently have quite rich addressing mecha- 
nisms, such as URLs and XPointers for the Web. Modes of external address- 
ing for other kinds of information aren’t as well developed. For example, 
what are reasonable ways to identify information elements in relational or 
object-oriented databases? Is it possible to support addressing in virtual 
data, such as Web pages generated on the fly from data stored in a different 
form? 
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Abstract. As the size of Web applications grows, it becomes clear that 
we need better tools to deal with their growing complexity. The cur- 
rent trend has been to assist the developer during the implementation 
stage, with little or no emphasis in the design process. Formal specifi- 
cation languages allow the unambiguous description of the properties of 
a system without restricting its implementation. Formal languages can 
be used to verify properties about the design. We present in this paper 
Flash, a formal specification language for hypertext design. Based in set 
theory, Flash is a formal system that attempts to separate the different 
tasks faced during the design process. A Flash specification first formal- 
izes the content of the application and its relationships. Then it collates 
that content into navigational composites. Finally, it specifies how those 
composites can be navigated. Each stage is clearly specified with precise, 
unambiguous syntax and semantics. Furthermore, Flash verifies proper- 
ties such as completeness and type consistency of the specification. 



1 Introduction 

The World-Wide Web is an eclectic collection of hypermedia applications. Some 
of these applications are very simple, while others are complex information sys- 
tems comprising a large number of pages. As time passes by, webmasters have 
learned that the development of these large Web sites is not a simple task. 
The term Web Engineering promotes the concept that building Web sites is 
an engineering process, similar to software engineering; hence, Web engineering 
is concerned with the development of Web sites the same way that software 
engineering is concerned with the development of software. 

In order to tackle the ever growing complexity of hypermedia applications 
-including Web sites- several methodologies for hypermedia development have 
been suggested (such as Schwabe and Rossi’s Object-Oriented Hypermedia De- 
sign Model -OOHDM- [10]; Isakowitz et al.’s Relational Management Methodol- 
ogy -RMM- [5], Lange’s Object Oriented Design Method -EORM- [6], Garzotto 
et al.’s Hypermedia Design Model -HDM- [4]). These methodologies are, pri- 
marily, guidelines to be followed during the design process. They also specify 
the characteristics of the deliverables, that are created at each of their stages. 
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These products are usually informally specified -in the sense that they do not 
have formal syntax nor formally defined semantics- and they are not required 
to pass validity tests. 

It is important that such methodologies use well-defined formal descriptions 
for their deliverables in order to facilitate the creation of support tools and 
the verification of properties of the design (such as whether the specification 
is syntactically and semantically correct, is complete or is type consistent, for 
example). Formal methods have been used in the past to specify the time con- 
straints of dynamic hypermedia applications ([1, 9, 8]), and they have been used 
to specify specific characteristics of the hypermedia platform [2] . They have not 
been used to specify the deliverables of the hypermedia mythologies for discrete 
hypermedia applications (discrete hypermedia applications are those which do 
not have dynamic or time-based components). 

Formal specifications use a mathematical notation to describe, precisely and 
unambiguously, the properties which an information system must have without 
unduly constraining the way in which those properties are achieved [12]. Ques- 
tions regarding the characteristics of the final application can be answered with 
confidence without resorting to analyzing the final application; furthermore, for- 
mal specifications are unambiguous, contrary to diagrammatic or textual speci- 
fications. 

As Janet Wing states, a formal specification language “provides the means of 
precisely defining notions like consistency and completeness, and more relevantly, 
specification, implementation and correctness. It provides the means of proving 
that a specification is realizable, proving that a system has been implemented 
correctly, and proving properties of a system without necessarily running it to 
determine its behavior.” [13]. 

A formal specification language for hypermedia development would allow 
a developer to strictly specify a hypermedia application and to verify certain 
properties about it. Such a language will assist the designer to find errors in 
the design which otherwise can be only found during the implementation or 
testing of the application. Maintenance is also improved: the ability to verify the 
specification will allow the maintainer to realize the impact of changes in the 
design before actually implementing them. 

2 Flash, a formal specification language for Web 
applications 

Flash, the Formal Language for the Specification of Hypermedia, is a formal spec- 
ification language for hypermedia design. Flash is object-oriented and attempts 
to be rich enough to specify the design of discrete hypermedia applications. Flash 
uses high-order logic and it is based in PVS [7] and Z [12]. 

Flash uses the data and process model of OOHDM. OOHDM is a model- 
based approach for building large hypermedia applications. It is composed of 
four activities; conceptual design, navigational design, abstract interface design 
and implementation. The conceptual design is concerned with the description 
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of the classes and relationships relevant to the target domain of the hyperme- 
dia application. The conceptual design is specified using an entity-relationship 
diagram. The navigational design specifies a view over the conceptual schema. 
The navigational design is described with two schemas; the navigational class 
schema (or NS) and the navigational context schema (or NCS). The NS describes 
navigational classes, which are of two types: composites and links. The compos- 
ite objects are views over the conceptual design. They can be composed into 
complex entities which can be presented to the reader as one node (or page) in 
the final application. Link classes are equivalent to relationships in the concep- 
tual design. They provide the means to navigate the different composites. The 
NCS is concerned with the creation of navigational contexts. A context is a set of 
navigational objects which creates a “space” in which a reader can “move” . Con- 
texts are used to create access structures (such as menus) to the set of objects 
they contain. A context can be defined by enumeration (for example, “include 
in this context the navigational objects corresponding to John and Mary”), or 
predicate-based (“include the news which are newer than one day”). 

2.1 Designing Magazine Web site 




Fig. 1. OOHDM Conceptual Schema of the Magazine Web Site 



In order to clarify our description, we present the design of the Web site of a 
magazine (based on the example in [11]). This Web site presents Stories (which 
can be sub-classified into Essays, Translations and Interviews. There are two 
more classes: Persons (which can be the authors or the interviewees) and Q&A. 
Figure 1 shows a conceptual schema. For the sake of simplicity, the number of 
classes and their attributes and methods has been minimized. 
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Fig. 2. OOHDM Navigational Class Schema of the Magazine Web Site 



Figure 2 shows the NS. It shows how the Story class has been redefined 
to include its author (obtained by using the relationship Is Author Of). The 
Interview class has been enhanced to include the name of the interviewee. 




Fig. 3. OOHDM Navigational Context Schema of the Magazine Web Site 



Figure 3 shows the NCS. It describes how the Web site can be traversed 
either by Section (defined by the attribute section of the Story class), by Author 
(instances of the Person class which are related to Stories by the relationship 
Is Author Of) and by a given list of highlights. Each of these contexts links the 
context indices (blocks in dashed lines) to the navigational class Story. 

3 Formal specification of the Magazine Web Site 

A Flash specification starts with the declaration of the types used by the ap- 
plication. In Flash, there are two different types: atomic and collection. Atomic 
types are user defined, predefined {integer, real, boolean), or enumerated (the 
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type defines a finite sequence of values the instance can take). Types which are 
not atomic should be collections. Collections gather instances of the same type. 
Flash includes three collection types: set, bag and sequence. As their name im- 
plies, sets are unordered collections of non repeated elements; hags are unordered 
but they can contain repeated elements; finally, sequences are ordered lists of 
elements. Because Flash is not concerned with the implementation, some types 
can be defined as “known”, i.e. these are atomic types whose internals are are 
not relevant to the specification, but they are used to guarantee type consistency. 
Figure 4 shows the type declarations for the magazine site. 



[date , audio , photo . emailaddress , url] ; 
section.type : {international, sports, local, national}; 
photos : SET OF photo; 

Fig. 4. Type declarations in Flash 



Flash describes textually the conceptual schema. Figure 5 shows the Flash 
schema for the magazine application. The conceptual schema describes the types, 
classes and relationships that compose the data of an application. A class defini- 
tion contains a list of type signatures of its component attributes and methods. 
Because classes are not modifiable by the hypermedia application. Flash makes 
no distinction between attributes and methods. 



CLASS story IS 
Title : string; 

SubTitle : string; 

Date : date ; 

Summ^Lry: string 
Text : string; 

Section: section.type ; 

END; 

CLASS Q_and_A IS 
Question: string; 

Answer : string; 

END; 

CLASS Interview 

INHERITS FROM story IS 



Sound : audio ; 

Illustration: photos; 
Questions : 

SEQUENCE OF Q.and.A; 

END; 

CLASS Person IS 
Name : string; 

Bio : string; 

Address: string; 

Telephone: string; 

HomePage : url; 

Email: string; 

Photo : photo ; 

END; 



Fig. 5. Flash Specification of the ConceptuaJ Design of the Magazine Web Site 



Relationships are fundamental for a hypermedia application. They are used 
to combine content into higher level composites and to be the base of hyperlinks. 
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Relationship definitions include their name, their type signature and any pred- 
icates that restrict the relationship. In the predicate section (see figure 3), the 
result predicate is the conjunctive union of its statements (each separated by 
Finally, the ASSUME keyword is used to predefine the type of variables which 
are later bound in the predicates. Notice that the first relationship (Is-Author) 
specifies that stories can be written by one or more authors. Notice also that 
every story should have an author. The second relationship (Related-To) does 
not have any constraints. Finally, Grants relates persons to interviews and guar- 
antees that for every interview there is a person “granting” that interview. 

The navigational classes schema is a collection of composite definitions. Com- 
posite definitions are classes which combine content classes or other composites. 
They usually are parametrized. A composite declaration is composed of five 
parts: a unique identifier, a parameters list, an inheritance definition, an at- 
tributes section and a predicates section. The parameter list specifies the type 
signature of the objects or composites needed to instantiate the class. Usually, 
composites don’t define data; instead, they combine both objects (from the con- 
ceptual design) and other composites. The inheritance definition specifies which 
is the class the current class is inherited from and it is potentially empty. The at- 
tributes section defines further attributes required by the current class. Usually, 
the value of these attributes depends on the value of the parameters in combi- 
nation with relationships. Finally, the predicate section specifies a sequence of 
invariants that the instance of that class should satisfy. At any given state of the 
object, the conjunctive union of the predicates should be true. 



RELATIONSHIP Is.Author 

FROM Person TO SET OF story 
WHERE 

ASSUME P : Person 
ASSUME SS: SET OF Story 
ASSUME S: Story 
FOREACH S: 

EXISTS SS, PS: 

S in SS AND 
Is_Author(PS,SS) ; 

END; 

RELATIONSHIP Related.To 

FROM Story TO SET OF Story 
END; 



RELATIONSHIP Grants 
FROM Person 
TO SET OF Interview; 

WHERE 

ASSUME I : Interview 
ASSUME SI: SET OF Interview 
ASSUME P : Person 
FOREACH I: 

EXISTS P,SI: 

GrantsCP.SI) AND 
I IN SI 

END; 



Fig. 6. Definition of the relationships of the conceptual design 



Figure 7 shows an example of composites. Story Composite declares a com- 
posite which is very similar to Story but it also includes its author. Notice the 
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attribute P of the class and how the invariant bounds P to the author of the 
Story. In the second composite, the AuthorComposite includes a list of all the 
stories written by that author. The conceptual schema only defines Persons and 
not authors nor interviewees; these are specified by relations. The parameter to 
AuthorComposite cannot be restricted to be an author (only Person, because 
there is no Author class); nonetheless, the invariant guarantees that there are no 
AuthorComposites which are composed of a Person who has not written a story. 
Allinterviews shows how to create a composite without a parameters section. 
In this case, it is indispensable to include the attribute and predicate sections. 
Finally, InterviewWithPhoto Composite shows how inheritance can be used in 
the definition of composites. In this case, the subclass restricts the composite to 
include only those interviews that have at least one photo. 



COMPOSITE 

StoryComposite (S : Story) 

P: Person; 

WHERE 

EXISTS P : Is.Author.Of (P,S); 
END; 

COMPOSITE 

AuthorComposite (P: Person) 

S : SET OF Story; 

WHERE 

EXISTS S : Is_Author_Of (P,S); 
END; 



COMPOSITE Allinterviews 
I : SET OF Interviews; 
WHERE 

I = DOM(Interview) ; 

END; 

COMPOSITE 

InterviewWithPhotoComposite 
(I; Interview) INHERITS 
FROM StoryComposite 

WHERE 

I. Photo != EMPTY.SET; 
END; 



Fig. 7. Flash Specification of the Navigational Classes of the Magazine Web Site 



The semantics of inheritance require further explanation. A class C can be 
inherited from class P if C is type compatible with P. C (with parameter’s 
type signature Ci, ...,Ci) is type compatible with P (Pi, ..., Pj) iff i >= j and 
Vfe € l...j Pfe is type equivalent to Ck or Pfe is an ancestor of Cfc. In the child 
class, the resulting attributes are the union of the attributes of the child and the 
parent; the result predicate section is the conjunction of the predicate section 
of the parent and the child. Under some circumstances, some statements in the 
parent’s predicate section can be altered by the child by using the hide operator 
(similar to its homonym in Z). 

OOHDM defines contexts as sets. Unfortunately, this definition requires an 
external mechanism for sorting its components. In Flash contexts are formally 
defined as sequences (lists) of navigational objects. Contexts can contain ob- 
jects of different classes; hence, they are not restricted to contain elements of a 
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given class. Flash provides a toolkit of operators specifically to create and alter 
contexts (similar to those of Scheme). 

Figure 8 shows the specification of the contexts. The Stories Context is de- 
clared using a context constructor. The constructor creates a sequence of Sto- 
ryComposites (whose type is declared in the OF part of the constructor. Each 
Story Composite corresponds to one Story in the domain of the application (spec- 
ified in the WHERE section). The WHERE section specifies how to create a set 
of composites, which have to be ordered. The ordering is specified by an order- 
ing function (which is declared, but not defined in the application) and a tuple 
(which specifies which attributes of the object are used as parameters to the 
sorting function). In this example, the attribute to use is the date; therefore, the 
context is a list of stories composites ordered by date. The absence of an ORDER 
section indicates that the order is unimportant and is implementation dependent. 
The TopStoriesContext selects the first 10 elements of the StoriesContext. 



ORDERFUNCTIONS = [sort.date (story)] 

CONTEXT StoriesContext = 

CREATECONTEXT OF S:StoryComposite 
WHERE 

FORALL St: Story IN dom (Story) 
S = StoryComposite(St) ; 

ORDER 

S.Date BY sort_date 

END; 

CONTEXT TopStoriesContext = 

(HEAD 10 StoriesContext) END; 

HighlightsList : VAR 
SEQUENCE OF story; 



CONTEXT HighLightsContext = 
(APPLY(HighlightsList , 

StoryComposite)) END; 

CONTEXT StoriesByAuthor = 

CREATECONTEXT OF S: StoryComposite 
WHERE 

FORALL St: Story IN dom(Story) 
S = StoryComposite (St) ; 

ORDER 

S . Author 

END; 



Fig. 8. Specifications of Contexts 



Hightlights Context is an example of the creation of a context based on a given 
set. In this case HighlightsList is a given sequence of stories. Highlights Context 
is created by instantiating StoryComposite with the contents of the sequence. 
The first parameter to APPLY is already a sequence and its result is a sequence. 
StoriesByAuthor specifies that there is a context of Story Composites which is 
ordered by author. The attribute Author (of the object S) is a string, which is a 
simple type, and has a default ordering function (lexicographical order, in this 
case). 




Formalizing the Specification of Web Applications 289 



4 From Specification to Implementation 

Up to this point, we have presented the specification of content, composites and 
contexts. We have not specified what constitutes a page and what anchors are 
linked to which pages. This is deliberate. By isolating presentation ft-om the 
conceptual and navigational schema, it is possible to implement the same design 
in many different ways. For instance, in the Magazine application, all the stories 
could be presented in a single page, or each page could be presented in different 
pages, having links that allow the reader to navigate the context sequentially 
(previous, next) or through a “table of contents” of the context. This separation 
is important when the application has different “versions” intended for different 
audiences, or when it has to run in different environments, each with different 
capabilities (for example, the same information has to be presented in paper, in 
the Web and in CD-ROM). 

The conceptual and navigational design are further refined by the interface 
design. During the interface design, it is decided which composite classes are 
converted to nodes (or pages). It is also decided which attributes of the classes 
are presented (for instance, it might be relevant to show only the summary of the 
stories within a given context, but the whole text under another one). A given 
navigational object can be included in two different contexts. As a result, the 
designer must decide whether to create two instances of the same object (one for 
each context) or to create only one which is referenced by both contexts. The 
former approach allows the implementor to present the two different instances 
in different ways, while the latter can only be presented in one. We are currently 
researching the requirements for a specification language for the interface design 
that addresses these issues. 

The final stage of OOHDM is the implementation. It is at this point that the 
content is actually converted into pages, and it is decided what kind of typeset- 
ting should be used to present the different attributes of the navigational objects. 
The implementor should make decisions depending on the characteristics of the 
run-time system on whether create a static application (as a collection of HTML 
pages) or CGI scripts that dynamically create pages. Flash is not concerned 
about the specifics of this stage. Nonetheless, the formal nature of Flash and the 
detail with which it can describe the design of an application opens the possibil- 
ity for automatic instantiation of the implementation. For example, WCML [3] 
is a language used to implement Web applications based on OOHDM; by adding 
the information needed to turn the specification into an implementation, it will 
be possible to translate Flash specifications into WCML programs. 

5 Verifying properties of the Specification 

Flash gives the designers immediate advantages. First, it has a formal syntax 
and semantics, making the specification unambiguous. Second, because Flash is 
based on high order logic, we can verify the type consistency of the specification. 
Third, the designer knows whether the specification is complete (all identifiers 
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should be defined before being used) even though it is possible to leave “parts” 
unspecified by using “given” types. These are advantages impossible to attain 
using diagrammatic or informal textual descriptions. 

We are currently working in the automatic translation of Flash specifications 
into PVS. PVS, the Prototype Verification System, is an interactive environment 
for the verification of the specification of software systems. The user verifies a 
specification in PVS by proving theorems about it. These theorems are written 
by the user. 

Figure 9 shows the PVS equivalent of the conceptual schema of the magazine 
site. 

By converting the specification to PVS, the designer can verify properties 
that are not verifiable in Flash. For example, the predicates “is the number of 
Interview WithPhotoComposite objects always equal to the number of Interviews 
objects?” and “can there exists a StoryNode without an author?” . Although the 
questions are obvious to answer, there are no mechanisms in OOHDM or in 
Flash to answer these questions. 



6 Conclusions 

Flash is a formal specification language for hypermedia design which is based in 
the Object-oriented Hypermedia Design Methodology. The main objectives of 
Flash are; 

— to specify unambiguously the data needed for an application and how it is 
collated into complex composites; 

— to provide a mechanism for type verification of the specification -Flash is 
strongly-typed; 

— to allow verification of completeness; any referenced item or type must be 
specified either explicitly as a schema, or «is a given type; 

— to be powerful enough to express the design of discrete hypermedia applica- 
tions. 

By formally specifying the design of a hypermedia application, we can reason 
about it without actually implement it. 
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Abstract. “A pattern ... describes a problem which occurs over and 
over again in our environment, and then describes the core of the so- 
lution to that problem, in such a way that you can use this solution a 
million times over” [1]. The possible benefits of using design patterns 
for Web applications are clear. They help fill the gap between re- 
quirements specification and conceptual modeling. They support 
conceptual modeling-by-reuse, i.e. design by adapting and combin- 
ing already-proven solutions to new problems. They support conceptual 
modeling-in-the-very-large, i.e. the specification of the general fea- 
tures of an application, ignoring the details. This paper describes relevant 
issues about design patterns for the Web and illustrates an initiative of 
ACM SIGWEB (the ACM Speciad Interest Group on Hypertext, Hy- 
permedia, and the Web). The initiative aims, with the contribution of 
researchers and professionals of different communities, to build an on-line 
repository for Web design patterns. 



1 Introduction 

There is a growing interest in design patterns, following the original definition 
by architect Christopher Alexander [1]. In this paper we focus particularly on 
the notion of “pattern” as applied to the process of designing a Web application 
at the conceptual level. 

Patterns are already very popular in the software engineering community, 
where they are used to record the design experience of expert programmers, 
making that experience reusable by other, less experienced, designers [2, 6, 8, 9, 
19, 21, 27]. Only recently have investigations studied the applicability and utility 
of design patterns to the field of hypermedia in general, and to the Web in 
particular [3, 7, 17, 22, 25, 26]. 

It can be interesting to analyze how design patterns modify the traditional 
approach to conceptual modeling. Conceptual modeling allows describing the 
general features of an application, abstracting from implementation details [5]. 
Conceptual modeling methodologies in general, however, offer little guidance 
on how to match modeling solutions to problems. Design patterns, instead, ex- 
plicitly match problems with solutions, therefore (at least in principle) they 
are easier to use and/or more effective. Design patterns, in addition, support 
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conceptual modeling by-reuse: i.e. the reuse of design solutions for several sim- 
ilar problems [13,22,25], Finally, design patterns, support conceptual modeling 
in-the-very-large. They allow designers to specify very general properties of an 
application schema, and, therefore, they allow a more “abstract” design with 
respect to traditional conceptual modeling. 

Despite the potential and popularity of design patterns, only a few effectively 
usable design patterns have been proposed so far for the Web. This lack of 
available design patterns is a serious drawback, since design patterns are mainly 
useful if they are widely shared and circulated across several communities. Just 
to make an obvious example, navigation design patterns (originating within the 
hypertext community), cannot be completely independent from interface design 
patterns (originating within the visual design community). Both types of design 
patterns are more effective if they are shared across the respective originating 
communities. 

The main benefits we expect from widespread adoption of design patterns 
for the Web include the following. 

— Quality of design. Assume that a large collection of “good” and tested de- 
sign patterns is available. We expect by reusing patterns that an inexperi- 
enced designer would produce a better design with respect to modeling from 
scratch. 

— Time and cost of design. For similar reasons we expect that an inexperienced 
designer, using a library of patterns, would be able to complete her design 
with less time and less cost than would otherwise be required. 

— Time and cost of implementation. We expect that implementation strate- 
gies for well-known design patterns are widely available. We may therefore 
expect that the implementation of an application designed by composing de- 
sign patterns will be easier than implementing an application designed from 
scratch. Additionally, in the future we may expect that toolboxes for imple- 
mentation environments, such as JWeb [4], Autoweb [24], and Araneus [20], 
will provide automatic support for a number of different design patterns. 

The rest of this paper is organized as follows. In Section 2 we describe a way 
to define design patterns and discuss conceptual issues involved with the defi- 
nition of a pattern. In Section 3 we introduce examples of design patterns. In 
Section 4 we state our conclusions and describe the ACM-SIGWEB initiative 
for the development of a design-pattern repository. 

2 Defining Design Patterns 

The definition schema for a pattern includes the following: 

— The pattern name, used to unambiguously identify the pattern itself. 

— The description of the design problem addressed by the pattern. It describes 
the context or scenario in which the pattern can be applied, and the relevant 
issues. 
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— The description of the solution proposed by the pattern, possibly with a 
number of variants. Each variant may describe a slightly different solution, 
possibly suitable for a variant of the problem addressed by the pattern. The 
use of variants helps keep down the number of distinct patterns. 

— A list of related patterns (optional). Relationships among patterns can be 
of different natures. Sometimes a problem (and the corresponding solution) 
addressed by a pattern can be regarded as a specialization, generalization, 
or combination of problems and solutions addressed by other patterns. One 
pattern might address a problem addressed by another pattern from a dif- 
ferent perspective (e.g. the same problem investigated as a navigation or an 
interface issue). Several other relationships are also possible. 

— The discussion of any aspect which can be useful for better understanding the 
pattern, also including possible restrictions of applicability and conditions 
for effective use. 

— Examples that show actual uses of the pattern. Some designers claim that 
for each “real” pattern there should be at least three examples of usage by 
someone who is not the proposer of the pattern itself. The idea behind this 
apparently strange requirement is that a pattern is such only if it is proved 
to be usable (i.e. is actually used) by several people. 

In the remainder of this paper we use this simple schema to describe design 
patterns. 

A design model has a different aim than a design pattern. The purpose of 
a conceptual model is to provide a vocabulary of terms and concepts that can 
be used to describe problems and/or solutions of design. It is not the purpose 
of a model to address specific problems, and even less to propose solutions for 
them. Drawing an analogy with linguistics, a conceptual model is analogous to 
a language, while design patterns are analogous to rhetorical figures, which are 
predefined templates of language usages, suited particularly to specific problems. 

Therefore, design models provide the vocabulary to describe design patterns, 
but should not influence (much less determine) the very essence of the patterns. 
The concept behind a design pattern should be independent from any model, ^ 
and, in principle, we should be able to describe any pattern using the primitives 
of any design model. 

In this paper we use the Web design model HDM, the Hypermedia Design 
Model [10-12], but other models (such as OOHDM [28] or RMM [18]) could also 
be used. We now discuss some general issues concerning Web design patterns. 

2.1 Design Scope 

Design activity for any application, whether involving patterns or not, may con- 
cern different aspects [5]. The first aspect, relevant to Web applications, is related 
to structure. Structural design can be defined as the activity of specifying types 
of information items and how they are organized. 

^ Continuing our analogy, a rhetorical figure is independent from the language used 

to implement it. 
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The second aspect is related to navigation across the different structures. 
Navigation design can be defined as the activity of specifying connection paths 
across information items and dynamic rules governing these paths. 

The third relevant aspect is presentation/interaction. Presentation/interac- 
tion design is the activity of organizing multimedia communication (visual + 
audio) and the interaction of the user. We could also distinguish between con- 
ceptual presentation/interaction, dealing with the general aspects (e.g. clustering 
of information items, hierarchy of items), and concrete presentation/interaction, 
dealing with specific aspects (e.g. graphics, colors, shapes, icons, fonts). 

If ideally each design pattern should be concerned with only one of the above 
aspects, we should also admit that, sometimes, the usefulness (or the strength, or 
the beauty, ...) of a pattern stems from considering several aspects simultaneously 
and from their relative interplay. 



2.2 Level of Abstraction 

A design problem can be dealt with at different levels of abstraction. For a 
museum information point, for example, we could say that the problem is “to 
highlight the current main exhibitions for the casual visitor of the site.” Alter- 
natively, at a different level of abstraction, we could say that the problem is “to 
build an effective way to describe a small set of paintings in an engaging manner 
for the general public.” Yet at another level we could say that the problem is 
to “create an audio-visual tour with automatic transition from one page to the 
next.” The third way of specifying the problem could be considered a refinement 
of the second, which in turn could be considered a refinement of the first. 

Design patterns can be useful or needed at different levels of abstraction, 
correspondingly also to the different types of designers. Interrelationships among 
patterns at different levels should be also identified and described. 



2.3 Development Method 

Some authors claim that patterns are discovered, not invented [25, 28], implicitly 
suggesting that not only the validation but also the identification of a pattern 
is largely founded on the frequency of use of a given design. 

This approach is only partially correct. First, the pure analysis of existing 
solutions does not detect their relationships to application problems (which are 
usually unknown and must be arbitrarily guessed). In addition a pure frequency 
analysis could lead to deviating results. In a frequency analysis of the navigation 
structure of 42 museum Web sites (selected from the Best of the Web nomination 
list of “Museums and the Web” in 1998 [7]), we detected a number of “bad” 
navigation patterns. The term “bad” in this case is related to the violation of 
some elementary usability criteria [14-16,23]. 

We are therefore convinced that frequency of use is useful for a posteriori 
validation, but it can not deliver, for the time being, a reliable development 
method. Until the field matures, design patterns for the Web should come from 
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a conscious activity of expert designers trying to condense their experience, and 
the experience of other designers, into viable problem/solution pairs. 

3 Examples of Design Patterns for the Web 

In this section we give definitions of several design pattern examples. Since lack of 
space prevents us from explaining HDM in depth, we do not use precise modeling 
primitives to describe the patterns; the negative effect upon the precision of the 
definition is balanced by the enhanced readability. 

The goal is to convince the reader that a design pattern can condense a 
number of useful hints and notions, providing a contribution to the quality and 
efficiency of design. We remind the reader that design patterns are devices for 
improving the design process, mainly aiming at inexperienced designers, not as 
a way for experienced designers to discover new very advanced features. 

The examples we discuss in this section are related to collections [11,12] in 
the terminology of HDM. A collection is a set of strongly correlated items called 
members. Collection members can be pages representing application-domain ob- 
jects, or can themselves be other collections (nested collections). A page acts as 
a collection entry point: information items associated with the entry point are 
organized in a collection center, according to HDM (and this will be elaborated 
later). Collections may correspond to “obvious” ways of organizing the members 
(e.g. “all the paintings,” “all the sculptures”), they may satisfy domain-oriented 
criteria (e.g. “works by technique”), or they may correspond to subjective crite- 
ria (e.g. “the masterpieces” or “the preferred ones”). The design of a collection 
involves a number of different decisions, involving structure, navigation, and 
presentation (see references). 

Pattern Name. Index Navigation 

Problem. To provide fast access to a group of objects for users who are inter- 
ested in one or more of them and are able to make a choice.^ 

Solution. The core solution consists of defining links from the entry point of 
the collection (the collection center in HDM) to each member, and from each 
member to the entry point. To speed up navigation, a variant to this core so- 
lution consists of including in each member links to all other members of the 
collection, so that the user can jump directly from one member to another with- 
out returning to the collection entry point. 

Related Patterns. Collection Center, Hybrid Navigation 
Discussion. The main scope of this pattern is navigation. In its basic formula- 
tion, it captures one of the simplest and most frequent design solutions for nav- 
igating within collections, adopted in almost all Web applications. The variant 
is less frequent, although very effective to speed up the navigation process. The 

^ This requirement is less obvious that it may appear at first sight; in many situations, 
in fact, the reader may not be able to make a choice out of list of items. She may 
not know the application subject well enough; she may not be able to identify the 
items in the list; she may not have a specific goal in mind, etc. 
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main advantage of the variant is that a number of navigation steps are skipped if 
several members are needed. The disadvantage of the variant is mainly related to 
presentation:^ displaying links to other members occupies precious layout space. 
Examples. Navigation “by-index” is the main paradigm of the Web site of the 
Louvre museum (www.louvre.fr; May 18, 1999), where most of the pages rep- 
resenting art objects are organized in nested collections, according to different 
criteria. Figure 1 shows the entry point of the top-level collection, where users 
can navigate “by-index” to each specific sub-collection. Figure 2 shows the entry 
point (center) of the sub-collection “New Kingdom of Egyptian Antiquities,” 
where each artwork from Egypt dated between 1550 and 1069 BC can be ac- 
cessed directly. From each artwork, the only way to access another page within 
the same collection is to return to the center. 



Louvtc nflKMl Wefafrie Nelic^pe 



^ £<fi Ymm (la £a*«nnc«le( titb 



- Historv of tfi« Louvr* 

- Coll«etioni 

- ViHimI Tour 
' MoQaiiM 



• Ttmperary 
Exfiibruont 

• Auditoriu«T> 

- OuKlod Tour* 



• Viktlor't informatiofl 
' MailbOM* 

• fubiKaiiofla and 
dalabava* 



S' 



OrwnioJ AnltquUiu 
ialamtc Art 



TcfEtiSaUi 
Louvrv.udu 
- Shop 0*iin' 









Enutcan ami Sculpture* 
Romon Aruiquitiei 





Pnni* ami D><nuintii Ht*tory of tha Louvra 
ar»d Pdadtaual Louvra 



Documant Den* 



l-j*. J?' i3 yg- 



Fig. 1. The center of a collection of collections, in the Louvre Weh site (www . louvre . f r) 



An example of the variant is found in the Web site of the National Gallery 
of London (www.nationalgallery.org.uk; May 18, 1999). Each painting in a 
collection can be accessed directly from the entry point page that introduces the 
collection (e.g. “Sainsburg Wing” in Figure 3) and from each member (e.g. “The 
Doge Leonardo Loredan” in Figure 4). Each page related to a painting has links 
to the other members, represented by thumbnail pictures, title, and author. 
Pattern Name. Guided Tour Navigation 

Problem. To provide “easy-to-use” access to a small group of objects, assuming 

^ Even this simple example shows that it is difficult to discuss one aspect, navigation, 
without discussing also other aspects, e.g. presentation. 
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Fig. 2. The center of the collection “Egyptiaui artworks dated between 1550 and 1069 
BC,” in the Louvre Web site (www.louvre.fr) 



fib ypM Qc Ccnmaacalai 



TiM National Qailory * Dh' Starliury Wci^ 



CaHarHan *<K{<a<aaiaa» ^^WHai'iNav Miataatia* taarck 

Salnabucy Wing 

Painting fiam 12 C 0 to 1510 

The Samebury Wmg comains the eartiest patnimgs in 
(he Gallery, includmg the Eariy Reraissance collectnn 
The majority of the paintings are Italian, but there ere 
also important groopa of Netherlandish and German 
arorks They comprise cleelly lettgious subjects, with 
some portraits and mythologicei partings Many of 
them are fragments of larger works and some hm 
suffered damage m the course of thee history. Among 
the mayor painters represented ere Duccio Mesacc o 
Jan yan Evck . Botticelli . Piero della f rancesca . 
Memlmg Mantegna BaBini . Raphael Leonardo da 
Vnci 




rit» Deg* LMKaie* Lawaan Th« u«a*aB« •rn* uiadM* vtiMMangUM ts* A« ftw>ei«a«« Ts« Antanuii eaieaii TMVir 

•iovantu e«lluii iaUM 0 an am Ovaata JaAvaaerW Lanna 

iJ 1 ^ 

Ooojmai* Done 1*1, Aia i 3 vA 




Fig. 3. The center of the collection “Sainsbury Wing" in the National Gallery of Lon- 
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Fig. 4. A member of the collection “Sainsbury Wing” in the National Gallery of London 
Web site (www.nationalgallery.org.uk) showing links to all other members of the 
collection 



the user has no reason (or is unable) to select one of them. 

Solution. The solution consists of identifying an order among the collection 
members, and creating sequential links among them. Links can be one-way or 
two-way (forward or backward). A variant is the circular guided tour, where the 
last member is linked to the first. 

To start the navigation, the page corresponding to the collection center must 
link to the first member. In order to allow return to the collection center several 
variants are available: establishing a link from every member, the last member, 
or the first member to the collection center. The circular variant is useful in 
the last case, in order to avoid the need for the user to scan the whole collec- 
tion backwards up to the first member. In order to improve usability [14, 15] it 
is advisable to support user’s orientation, i.e. to include in each member some 
perceivable visual cues about the current navigation status. Examples of such 
cues are an indication of the name of the current collection, the position of the 
current member, and the total number of members. 

Related patterns. Collection Center, Hybrid Navigation 
Discussion. This is a navigational pattern. It is suitable for “naive” users (with 
little knowledge of the application domain) or for first-time users (who need 
to get acquainted with the content of the application) or “couch-potato” users 
(who want to get an easy ride around, rather than engaging in free navigation). 
The main disadvantage of a guided tour is that its members must be traversed 
sequentially: it may be bothersome if the user wants to rapidly access a member 
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of his or her choice, or if there are many members. 

Examples. Guided tours are quite common, and we have found interesting 
examples in commercial sites of car companies. Though complex and highly 
structured in most cases, these applications are mainly intended to provide 
the user an easy, flamboyant presentation of new car models. For exam- 
ple, see Opel (www.opel.com), Volkswagen (www.volkswagen.com), Renault 
(www.renault.com), BMW (www.bmw.com), Porsche (www.porshe.com), Cit- 
roen (www.citroen.com), Mercedes (www.mercedes.com), Audi (www.audi.- 
com), Ferrari (www.ferrari.com), and Lamborghini (www.lamborghini.com); 
all inspected May 18, 1999. 

Most of these sites have a section called “Virtual Showroom” (or a similar 
name) that is organized as a guided tour. From a page promoting the style and 
the character of a car model, the user can access different views of a car (either 
pictures showing the interior or the exterior of the car or photos of the car on the 
road). The navigation structure of a virtual showroom is illustrated in Figure 5, 
and corresponds to a circular guided tour, as described above. 



Car 






I, 



Car View I Car View 2 



Virtual Showroom 



Fig. 5. The navigation structure of a Virtual Show Room 



Pattern Name. Hybrid Collection 

Problem. To provide easy-to-use access to a small group of objects, allowing 
both a complete scan and searching for a specific member. 

Solution. Combine solutions of Index Navigation and Guided Tour Navigation. 
Related Patterns. Index Navigation, Guided Tour Navigation, and Collection 
Center 

Discussion. This pattern provides a compromise to satisfy the requirements of 
different types of users, or of different situations of use, within the same design. 
Users can choose different navigation styles according to their (evolving) needs. 
First-time users, for example, may prefer to traverse the collection sequentially 
in the guided-tour style. For subsequent visits they may adopt the index style, 
looking for specific members already seen. 

Examples. This hybrid pattern is adopted, among other sites, at the National 
Gallery of Washington (www.nga.gov; May 18, 1999). This site presents a vast 






302 F. Garzotto et al. 



selection of paintings exhibited in the museum and is one of the best organized 
museum applications on the Web. Access to artwork is available through search- 
ing or via tours. The typical navigation structure of a tour is depicted in Figure 6. 
Each collection center provides a short introduction to its tour and shows a list 
of all the paintings included in the tour. Navigation can therefore proceed by 
selecting any of the listed works. Prom each work the user can either proceed to 
navigate to the center (index page) or directly choose another work. “Next” for 
the last member of the collection is linked to the center (index) page. 




Fig. 6. The navigation structure of a tour in the National Gallery of Washington 



Pattern Name. Collection Center 

Problem. To make a collection well understood and usable. 

Related Patterns. Index Navigation, Guided Tour Navigation, and Hybrid 
Navigation 

Solution. A collection center can provide several types of additional information 
to improve the usability and effectiveness of an application. For simple collec- 
tions it may be sufficient to provide the collection title and expressive anchors 
(place-holders) for its members. Link anchors have the dual role of describing 
the collection content and also providing a navigation mechanism. For complex 
collections it may be useful instead to design an additional page specifically de- 
voted to explaining the purpose, rationale, and background of the collection. In 
general a collection center may include the following elements: 

— An overview of the collection theme and purpose. 

— An explicit description of its scenarios of use. 

— Information common to all members (e.g. if the collection is about the works 
of a given artist, the center may include a general introduction to the artist’s 
work) . 

— Anchors (placeholders) for links that allow access to collection members. 
Anchors should identify their destination by means of appropriate icons, 
textual labels, or combinations of both. 

— Anchors may also convey expressive representations of destination objects. 
Such representations could provide a preview of collection members by means 
of thumbnail-sized pictures or short descriptions. 
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— The arrangement of anchors within the center should also convey the collec- 
tion topology, i.e. the arrangement of its members. 

Discussion. This pattern addresses both structure and presentation aspects. 
FYom a structural perspective, the key issue is how to select, for each member 
of the collection, information elements that best describe it. The choice may 
depend upon several factors: the nature of the object itself, the context provided 
by the collection, the user profile, etc. 

From a presentation perspective it must be possible to arrange the informa- 
tion elements of the center in an aesthetically pleasing presentation and in a 
usable manner. Let us consider the case where collection members are paintings. 
The information items associated with anchors of the collection center could be 
chosen from several possibilities, such as painting title, thumbnail photograph, 
artist name, technique, date and place of creation, or name of the museum where 
it is exhibited. Each of these information items could be very relevant or obvious, 
according to the context, user profile, or other factors. If the collection is “paint- 
ings of X,” the data about the painter are clearly irrelevant; if, however, the 
collection is “technique Y,” references to the technique are irrelevant. Miniatur- 
ized pictures might be relevant for moderately experienced users but irrelevant 
for art experts. 

Examples. The Louvre (www.louvre.fr) and the National Gallery of London 
(www.nationalgallery.org.uk), show examples of design solutions illustrated 
in the pattern. In both sites, the entry point to a collection of artwork is rep- 
resented in a separate page, where each member is described by a thumbnail 
picture, the artwork title, and period (in the Louvre) or the painter name (in 
the National Gallery). Examples of collection centers in these sites were shown 
in Figures 1, 2, and 3. 

While the collection center in the Louvre includes the collection title and the 
list of members, in the National Gallery site the center also provides a painting 
image (alluding to the theme of the collection) and a short textual introduction 
to the collection subject (see Figure 3). The introduction describes the selection 
criterion for the collected paintings, their main subjects, and the most important 
painters represented. 



4 Conclusions 

The design of complex Web applications can greatly benefit from the adoption 
of design patterns. These benefits include: 

— Improved quality of the result. 

— Reduction of the effort, time, and cost to develop. 

— Improved reliability. 

We wish to discuss again the interrelationship between conceptual design, as 
traditionally intended, and usage of design patterns. 




304 F. Garzotto et al. 



Conceptual models retain their usefulness, but with a radical change of role: 
instead of providing conceptual primitives that a designer uses to “think of” 
an application, they provide the basic lexicon and syntax that can be used to 
define design patterns. For this reason the discussion about merits and demerits 
of different modeling primitives becomes less relevant. The concept expressed by 
a design pattern is independent from the model used to describe it. 

The greatest impact is on the design process. The designer is induced to think 
in terms of application requirements (problems) and solutions (patterns), rather 
than in terms of pure modeling. Most of the benefits of such a change have al- 
ready been illustrated in this paper. We would like to mention here an additional 
benefit: the brevity of the documentation required. Describing a complex guided 
tour from scratch may require up to 2-3 pages of documentation; describing it 
by using HDM may require half a page of documentation; describing it by refer- 
ring to a well-known design pattern, requires only a few parameters, i.e. a few 
lines of documentation. The gains are precision, compactness, and completeness 
of the documentation. 

A more radical point of view could be the following: with design patterns we 
are introducing new, high-level design primitives, therefore using design patterns 
is the same as defining a new higher level design model. In a sense this argument 
is true, since the design pattern at one level is clearly a model primitive at a 
higher level. ^ Nevertheless the distinction remains practically relevant: we ex- 
pect from a design model a few consistent, well-organized primitives. We cannot 
expect the same from a large library of design patterns, where many people have 
contributed; we may expect there to be redundancy, inconsistency, lack of orga- 
nization, etc. Prom a theoretical point of view (to be further investigated) we 
may say that we need a field of design patterns until that field becomes mature 
enough that a new, higher level model can incorporate the essence of the design 
primitives required. 

The most relevant problem for a Web designer today, assuming s/he is con- 
vinced of the advantages of using design patterns, is to have a suitable supply 
of them at hand. Designers may generate patterns from their own experience, 
but probably there will not be many such patterns, especially if the designer 
is not very experienced. The second possibility is to get patterns from friends, 
but this may be time consuming and ineffective. Such patterns may be poorly 
organized, too specific, or too general to be useful; they may deal with the wrong 
area of interest (e.g. they emphasize presentation issues when the designer needs 
a navigational pattern); they may be defined at the wrong level of abstraction. 

Having recognized these problems, ACM SIGWEB (the ACM Special Inter- 
est Group on Hypertext, Hypermedia, and the Web) has launched an initiative 
to create an “On-Line WWW Design Pattern Repository” [17]. The repository 
is coordinated by a small steering committee, and operationally managed by 
Politecnico di Milano (for design and development) and USI - Universita della 

We could even say, for example, that the “for” operation in C-l--f is a pattern for 
combining assembly language instructions, or that a “many-to-many relationship” 
in the ER Model is nothing more them a pattern for the Relational Model. 
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Svizzera Italiana in Lugano-Switzerland (for designing and managing opera- 
tions). 

The basic idea is that the repository will be open to all contributions from 
all different communities, and it will cover all the possible areas of interest and 
all the interesting levels of abstraction. The repository will not enforce any pre- 
defined view of what a design pattern should, or should not, be. It will enforce a 
bare minimum of standardization of the description format and will require evi- 
dence of the usage of each submitted pattern: i.e. at least three designers, other 
than the person submitting the pattern, should have used it.® For practical rea- 
sons also tentative patterns will be accepted, i.e. patterns whose definition is 
supported by just one application. A system of keywords and classifications (by 
scope, application area, level of abstraction, etc.) will help designers to quickly 
locate useful patterns. 

Before Summer 1999 the repository will be open for submissions;® before the 
end of 1999 the repository will be open for access to designers. 

References 

1. C. Alexander, S. Ishikawa, M. Silverstein, M. Jacobson, 1. Fiksdahl-King and S. 
Angel. A Pattern Language. Oxford University Press, New York, 1977. 

2. B. Appleton. Patterns and Software: Essential Concepts and Terminology. Avail- 
able at www.enteract.com/'bradapp/docs/patterns-intro.html, 1997. 

3. Bernstein M. “Patterns of Hypertext”, In Proc. of the ACM International Confer- 
ence on Hypertext ’98, ACM Press, 1998, pp. 21-29. 

4. M. A. Bochicchio, P. Paolini, “An HDM Interpreter for On-Line Tutorials,” In Pro- 
ceedings of MultiMedia Modeling 1998 (MMM’98), pp. 184-190, Ed. N. Magnenat- 
Thalmann and D. Thalman, IEEE Computer Society, Los Alamitos, California, 
USA, 1998. 

5. Brodie M.L., “On the Development of Data Models”. In Brodie M.L., Mylopoulos 
J., and Schmidt J., (eds.) On Conceptual Modeling. Springer Verlag, 1984. 

6. M.P. Cline, “Using Design Patterns to Develop Reusable Object-Oriented Com- 
munication Software”, Communication of ACM, 38 (10), October 1995, pp. 65-74. 

7. Discenza A., Garzotto F., “Design Patterns for Museum Web Sites”. In Proceedings 
MW’99 — 3rd International Conference on Museums and the Web, New Orleans 
, USA, March 1999, pp. 144-153. 

8. M. Fowler, Analysis Patterns. Reusable Object Models, Addison- Wesley, 1997. 

9. E. Gamma, R. Helm, R. Johnson and J. Vlissides. Design Patterns. Elements of 
Reusable Object-Oriented Software. Addison- Wesley, 1996. 

10. F. Garzotto, P. Paolini and D. Schwabe. “HDM - A Model Based Approach to 
Hypermedia Application Design”. ACM Transaction On Information Systems, 11 
(1), 1993, pp. 1-26. 

® We remind the reader that a design pattern should not be an “intuition” but should 
synthesize some consolidated experience. 

® The interested reader, possibly willing to submit a pattern, is invited to contact 
davide.bolchini91u.unisi.ch or sara.valentiOhoc.elet.polimi.it for further 
information. 




306 



F. Garzotto et aJ. 



11. Garzotto F., L. Mainetti, P. Paolini “Adding Multimedia Collections to the Dexter 
Model”. In Proc. ACM ECHT’94, Edinburgh (UK), September 1994, pp. 70-80. 

12. Garzotto F., Mainetti L., Paolini P. “Hypermedia Design, Analysis, and Evaluation 
Issues”, Communications of the ACM, 38 (8), August 1995, pp. 74-87. 

13. F. Garzotto, L. Mainetti and P. Paolini. “Information Reuse in Hypermedia Ap- 
plications”. In Proc. of the ACM International Conference on Hypertext ’96, ACM 
Press, 1996, pp. 43-54. 

14. Garzotto F., Matera M. “A Systematic Method for Hypermedia Usability Inspec- 
tion”. In The New Review of Hypermedia and Multimedia, 3 (1), January 1997, pp. 
39-65. 

15. F. Garzotto, M. Matera, P. Paolini “To Use or not to Use? Evaluating Us- 
ability of Museum Web Sites”. In Proc. MW’98 - 2nd International Confer- 
ence on Museums and the Web, Washington DC, May 1998 — available at 
WWW. archimuse . coin/mw98/. 

16. Garzotto F., Matera M., Paolini P. “A FVamework for Hypermedia Design and Us- 
ability Evaluation.” In Proc. of IFIP-DEUMS’98 - International Working Confer- 
ence on Designing Effective and Usable Multimedia Systems, Stuttgart, Germany, 
September 1998, pp. 14-28. 

17. F. Garzotto, P. Paolini. “Design Patterns for WWW Hypermedia: Problems and 
Proposals.” In Electronic Proceedings of the ACM HT99 Workshop on Hypermedia 
Development: Design Patterns in Hypermedia — available at ise.ee.uts.edu.- 
au/hypdev/ht99w/. 

18. T. Isakowitz , E. Stohr, P. Balasubramanian “RMM: A Methodology for Structured 
Hypermedia Design”, Communications of the ACM, 38 (8), August 1995. 

19. C. Larman, Applying UML and Patterns. An Introduction to Object-Oriented Anal- 
ysis and Design, Prentice Hall PTR, 1998. 

20. G. Mecca, P.Atzeni, A.Masci, P.Merialdo, G.Sindoni: “The Araneus Web-Base 
Management System”. In Proc. SIGMOD’98, 1998, pp. 544-546. 

21. G. Meszaros, J.Doble, “A Pattern Language for Pattern Writing”, available at 
WWW . mit . edu/~ j tidwell/onteraction-patterns . html. 

22. M. Nanard, J. Nanard and P. Kahn. “Pushing Reuse in Hypermedia Design: Golden 
Rules, Design Patterns and Constructive Templates” . In Proc. of the A CM Inter- 
national Conference on Hypertext ’98, ACM Press, 1998, pp. 11-20. 

23. Nielsen J. Usability Engineering, Academic Press, New York, 1993. 

24. P. Paolini, P. Fraternali “A Conceptual Model and a Tool Environment for Devel- 
oping More Scalable, Dynamic, and Customizable Web Applications.” In Proc. of 
EDBT’98 Conference, Valencia, Spain, 1998, pp. 421-435. 

25. G. Rossi, D. Schwabe and A. Garrido. “Design Reuse in Hypermedia Applications 
Development”. In Proc. of the ACM International Conference on Hypertext ’91, 
ACM Press, 1997, pp. 57-66. 

26. G. Rossi, D. Schwabe and F. Lyardet, “Improving Web Information Systems with 
Design Patterns” . In Proc. of the 8th International World Wide Web Conference, 
Toronto (CA), May 1999, Elsevier Science, 1999, pp. 589-600. 

27. D.C. Schmidt, R. E. Johnson and M. Fayad. “Software Patterns”. Communications 
of the ACM, Special Issue on Patterns and Pattern Languages, Vol. 39, No. 10, 
October 1996, pp. 37-39. 

28. D. Schwabe and G. Rossi. “An Object Oriented Approach to Web-Based Applica- 
tion Design”. Theory and Practice of Object Systems, 4 (4), J. Wiley, 1998. 




A Unified Framework for Wrapping, Mediating 
and Restructuring Information from the Web 



Wolfgang May^ Rainer Himmeroder^ Georg Lausen^ Bertram Ludascher^ 

Abstract. The goal of information extraction from the Web is to pro- 
vide an integrated view on heterogeneous information sources via a com- 
mon data model and query language. A main problem with current ap- 
proaches is that they rely on very different formalisms and tools for wrap- 
pers and mediators, thus leading to an “impedance mismatch” between 
the wrapper and mediator level. In contrast, our approach integrates 
wrapping and mediation in a unified framework based on an object- 
oriented data model which represents both the Web structure and the 
data of the application domain. Wrappers and mediators are written in 
a rule-based object-oriented language which is augmented with features 
for Web access and structured document analysis, i.e., pattern match- 
ing by regular expressions and SGML parsing. In this paper, we develop 
generic, reusable rule patterns for typical extraction, integration, and 
restructuring tasks using this framework. We show the practicability of 
our approach by using the Florid system [10]. 



1 Introduction 

The Web is now the most popular information repository and there is a strong 
need for integration of data from different sources. Since Web sites can be seen 
as autonomous databases, there are many exceptions and pitfalls when wrap- 
ping such sources. Additionally, data representation can change without notice, 
and new sources can be added. Thus, a combination of rapid prototyping and 
refinement is highly desirable for Web data integration. Most approaches com- 
prise wrappers for translating data from different local languages into a common 
format, and mediators for providing an integrated view on the data. Existing 
approaches have a strictly layered architecture and employ different languages 
for wrappers and mediators. 

In [14], we showed that languages supporting deduction and object-orien- 
tation are particularly suited in this context and presented a formal model for 
querying structure and contents of Web data. Object-orientation provides a flexi- 
ble and rich data model, e.g., for handling partial information. In our framework, 
a unified model of the Web representation and the application-level semantic rep- 
resentation is generated [16], yielding an integrated language for wrapper and 
mediator rules and for querying. This avoids the impedance mismatch due to 
separate languages. By deductive rules, the process of wrapping, mediating, and 
restructuring can be defined in a high-level, declarative programming style. 

^ Institut fiir Informatik, Universitat Freiburg, Germany 
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In the present paper, we focus on the practical aspects by using the Florid 
system [10], an implementation of the deductive object-oriented database lan- 
guage F-Logic [12]. We present a systematic methodology for retrieving and inte- 
grating data from the Web by combining generic rule patterns with application- 
specific rules. The power and flexibility of generic rules can be fully exploited by 
the language which allows variables ranging also over objects and methods, and 
path expressions for navigating in the unified object-oriented model of both the 
Web fragment and the application-level representation. 

The paper is structured as follows: We present the formal framework under- 
lying our approach, including its Web model in Section 2. A methodology for 
extracting data from the Web by generic rule patterns for wrapping and medi- 
ating is proposed in Section 3. In Section 4, we show the practicability of our 
approach by combining generic rule patterns and application-specific rules for 
generating a database from several Web pages. Finally, we summarize our work 
and compare it to other approaches. 

2 A Formal Framework for Web Access 

By combining the rich modeling capabilities (objects, methods, class hierarchy, 
inheritance, signatures) of the object-oriented data model with the advantages of 
deductive database languages, F-Logic provides a suitable framework for mod- 
eling and handling Web information. The approach makes extensive use of the 
following facilities which are unique features of our integration approach; 

— The power and flexibility of generic rule patterns depends on the usage of 
variables ranging over objects (including host objects), classes, and methods. 

— Integration of information via object fusion. 

At a short glance, the syntax is as follows (we omit inheritable methods and 
signature specifications; for the full syntax and semantics, the reader is referred 
to [12]): 

— The language is based on variables, constants, and object constructors from 
which id- terms are composed as usual. Id- terms are interpreted as elements of 
the universe. By convention, object constructors start with lowercase letters 
whereas variables start with uppercase ones. Ground id-terms play the role 
of logical object identifiers (oids). 

In the sequel, let O, C, D, Qi, S, Si, Sc, and Mv stand for id-terms. 

— An is-a assertion is an expression of the form O : C (object O is a member 
of class C), 01 C : : D (class G is a subclass of class D). 

— The following are object atoms: 

• 0[Sc@{Qi, . . . ,Qk) — > 5]: applying the single-valued method Sc with ar- 
guments Qi,. . . ,Qk to O results in S, 

• 0[Mv@{Qi, . . . , <5a:)-»{'S'i, . . . , 5'n}]: applying the multi-valued method Mv 
with arguments Qi,. . . ,Qk to O results in some S^. 

— A rule is a logic rule h <— b over F-Logic’s atoms, i.e., is-a assertions and 
object atoms; a program is a set of rules. 
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Example 1. Below, a fragment of the database which is created by the applica- 
tion described in this paper is shown. For readability, we use mnemonic oid’s of 
the form o^^me • 

06 e; 5 : country[ name->" Belgium": car_code—>" B" ; capital->Obr«s»eis: 
total_area— >30510; population— >10170241; continent@(oeur)— *100; 
indep@(date)^"04 10 1830"; pop_growth— >#0.33; gdp-total^l97000; 

adm_divS ^{Op^^itwerp^ ^p-west/l main_citieS“»'^ 05 russe /5 , etantwerpt- ■ '}t 

ethnicgroups@(" Fleming")— >55; ethnicgroups@("Walloon")— >33; 
religions@(" Roman Catholic” )^75; religions@(" Protestant" )— >25; 
borders@(o/ra„ce)^620; borders@(Ogermonj/)-»167; 

borderS@(o;ua:embourp) >148, >450]. 

06russei3 :city[name->" Brussels"; country->06e/p; province->Op.,„est/i: 

longitude— >#4.35; latitude— >#50.8; population@(95)^951580j. 

Oantiuerp : city[name->" Antwerp" ; country— >Obeig : province— >Op.antu)erp; 

population@(95)— >459072; longitude— >#4.23; latitude— >#51.1]. 

Op.anti«erp:pi'ov[name-^"Antwerp"; country-^Obeig', capital-^ ^antwerpt 
area— >2867; population— >1610695]. 

Op.iuest/i :p''ov[name->"West Flanders”; country->06e/g: capital-xJbpus^eiai 
area— >3358; population->2253794]. 

Oeu :org[abbrev—>"EU";name->” European Union” ;establ@(date)— >”07 02 1992"; 
seat^Obrusseis; members@(” member" )-»{obeig, 0 /ronce.- • • }; 

members®(” membership applicant")-»{o/,„„garg, Owouofcio.- ■ • }]• 

In addition to the basic F-Logic syntax, the Florid system also supports path 
expressions in place of id-terms for navigating in the object-oriented model: 

— The path expression O.M is single-valued and refers to the unique object S 
for which 0[M 5] holds, whereas O..M is multi-valued and refers to every 

Si such that 0[M-»{5j}] holds. 

Example 2 (Path Expressions) . In our example, 

= Ofjrussels , Ogu -Seat. province = Op_westfl ■ 

The following query yields all names N of cities in Belgium: 

?- C:country[name—>" Belgium”], C..main.cities[name— >N]. 

Since path expressions and F-Logic atoms may be arbitrarily nested, a concise 
and extremely flexible specification language for object properties is obtained. 

Object Creation. Single-valued references can create anonymous objects when 
used in the head of rules. A rule of the form 
O.M[<properties>] <— <body(0,M)> 

creates an object x such that o[m->a:] and x[<properties>] hold whenever <body(o,m)> 
is satisfied. 

Negation-free F-Logic programs have a standard logic programming semantics 
[12]. For programs with negation, Florid allows inflationary and user-defined 
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stratified semantics. The Florid system^ provides an implementation of F-Logic 
extended with the Web model described below. 

The Web Model. The underlying object-oriented Web model [14] is based on the 
concepts of uri and webdoc, which represent url’s and Web documents, respec- 
tively. Every member u of class urI provides a special method u.get which is 
implemented as an active method: When u.get occurs in the head of a rule, the 
Web document which is accessible via u is accessed, assigned to the newly cre- 
ated object u.get, and made an instance of class webdoc. Typically, the document 
associated with a uri contains hyperlinks of the form “<a href = u' > £ </a>” to 
other url’s which refer to further Web documents. The hyperlink structure ( Web 
skeleton) is modeled by the built-in property webdoc. hrefs: 

u.get[hrefs@(£) -» {u'}j u.get contains "<a href = u' > £ </a>" . 

By u.get, “raw" (i.e., uninterpreted) data from the Web is addressed. This han- 
dling reflects only the very basic properties of Web documents, without further 
exploiting the document type or structure. The contents of the retrieved Web 
pages has to be analyzed by wrappers, using the internal structure of HTML or 
XML Web documents. The active method u. parse generates the F-Logic repre- 
sentation of the parse-tree of an SGML document and assigns it to the object 
u. parse : parsetree. Together with the Web skeleton, the parse-trees provide a 
comprehensive model of the relevant Web fragment [16]. Via path expressions, 
the user can then navigate through the extended Web skeleton, i.e., the link and 
parse-tree structure of a Web fragment. To our knowledge. Florid is the only 
system which is aware of the Web structure itself in its data model. Our ap- 
proach implements a hybrid concept by embedding data-driven wrapping into a 
warehouse approach: An F-Logic database is generated by iteratively accessing 
Web documents, analyzing them, traversing relevant hyperlinks, etc. 

3 Wrapping and Mediation by Generic Rule Patterns 

In our approach, the construction of wrappers and mediators is based on generic 
rule schemata for standard tasks which can be complemented by application- 
specific rules and refinements for handling exceptional cases to obtain the re- 
quired flexibility. For the wrapping task, depending on the structure of Web 
pages, parser-based, matching-hased, or combined wrapper rules are preferable: 

Structural Markup (tagged data): For logical markup such as lists or ta- 
bles which is subject to a specific HTML “subgrammar” , wrapping is based 
on u. parse, which is an object-oriented representation of the parse-tree. For 
each SGML tag t, the class u. parse. i contains all structures of type t on u: e.g., 
xiu.parse.ul holds for all <ul>-lists x on the Web document. Additionally, ev- 
ery tag induces a method for navigation in a parse-tree; e.g., for x:u. parse. ul, 
the expression a;.ul@(n) yields the n-th element of the unordered fist x. 

We define generic patterns for decomposing lists into their items, and for 
analyzing tables by identifying column headers and table entries. The inter- 

^ available at http://www.informatik.uni-freiburg.de/~dbis/florid/. 
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pretation of the contents is then done by application-specific rules. The usage 
of parse-trees is illustrated in Sections 4.2 and 4.3. 

Optical and Syntactical Markup (untagged data): If the structure is only 
partially tagged (paragraphs, line breaks, fonts, commas), the wrapper is 
based on pattern matching via regular expressions. Generic patterns are used, 
e.g., for decomposing “pseudo-lists” which are structured via emphasizing or 
bold face fonts, for commalists, and for name- value pairs (cf. Sec. 4.1 and 4.2). 
Here, matching with regular expressions is provided by the built-in predicate 
pmatch(<string>,”/<regexp>/", [<fmt-list>], [Xi,..., X„]) 
which extracts substrings by Perl’s regular expressions into Perl-variables and 
binds them to F-Logic variables Xi,. . . ,X„ as specified by <fmt-list>. 
Additionally, arbitrary predicates (e.g., for string processing) can be defined 
via Florid’s Perl interface: the predicate perl(<perl-routine-as-string>, X,Y) 
binds Y to the result(s) of calling <perl-routine-as-string> with X as argument. 

Often, logical markup and textual formatting are mixed, leading to combined 
generic rules (cf. Section 4.2). 

In contrast to wrapping, the mediation and integration task is usually domi- 
nated by application-specific rules. However there are still repeatingly occuring 
situations where generic rule patterns apply: Objects which result from different 
sources have to be merged. Similarly, entities can be represented by string-valued 
attributes in one source, and by objects in another; then, integration will derive 
a link to the corresponding object. 

In general, these generic rule patterns are not sufficient for wrapping and 
mediating information from Web pages - due to exceptions in the representation 
and to the complexity inherent to the application semantics. The “skeleton” of 
the program is assembled from generic rule schemata for the above standard 
tasks. Then, in a rapid-prototyping process, the program is completed manually: 

— Structural and syntactical exceptions are handled by refinements of generic 
rule patterns, 

— application-specific rules interpret the outcome of the generic patterns in a 
model of the application domain. 

The rule-based approach allows for modular design of wrappers and mediators. 

4 Patterns in Practice in Florid 

We illustrate our methodology by an excerpt of a larger case study which creates 
a geographical database [15]. The main data sources are the CIA World Factbook 
[4], which provides information about countries and political organizations, and 
Global Statistics [7], a collection of data about administrative divisions and cities. 
On the wrapper-level, an individual model for each of the data sources is obtained 
by instantiating generic patterns for wrapping Web documents. The mediation 
and integration part consists of fusing corresponding objects across sources and 
creating references between the partial models. Unlike other systems. Florid 
performs all tasks (data access, extraction, integration, and restructuring) in a 
single unified framework and thus overcomes limitations of the strictly layered 
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approach. For instance, regular expressions for wrapping can be derived at run- 
time based on previously extracted data from other sources. To our knowledge, 
Florid is the only system which allows such a “reflective” programming model. 
In the following, all instances of generic extraction patterns are labeled with 
I RULE-NAME | . 

4.1 CIA World Factbook: Wrapping Country Pages 

The CIA World Factbook provides political, economic, social, and some geo- 
graphical information. The pages can be classified as formatted text (see Fig. 1), 
thus, the wrapper is based on pattern matching via regular expressions. The 
regular textual format of the pages allows for very generic and easily extendible 
patterns. This holds for single-valued properties (both numeric and string, such 
as capital and population) and for multi-valued properties like border countries, 
languages, etc., which are given by lists of pairs (name, value). 



Bdgtam 

Geography 

Wutem EuTopc, boTiltitDt thc NotA Se^ between Rrince and the Ne&eiltiids 
Atm: Mai ana: 30,510 iqkm 

banttr cawitnoj: Pr«nced20 kn^ OeimaDgr 1£7 Ion, Lunmbour( 148 kn^ Netbeilinds 450 km 
People 

PopoUHco: 10,170,241 (July 1996 est) 

Elhak cUtUoiu: Fleming S5X, WiUoon 33X, mtzed or other 12X 
n-M^i — - Roman Catholic 75X, Prote(tantorother25X 

Dutch 56X, Prench32X,Oennan IX, legally bilingual IIX (dMded along ethnic bnes) 

GoTemmeiit 
CqdUl: Bnuteh 

tudapailaKa: 4 October 1830 (from the Netheilands) 

Economy 

GDP; purchasing power party - 1197 biUon (1995 est) 

Fig. 1. Excerpt of the CIA World F 2 ictbook: Countries 

First, we access the continent pages europe : continent [urlO(cia) -> . . . ] using 
the built-in active method uri.get: 

U:url.get C : continent [urlS(cia)->U] . 

Every continent page contains a list of links to the countries, these are automat- 
ically stored in the slots u.get.hrefs@(<label>). The country objects are defined 
with their urls, names, and continents, and the urls are accessed: 
cid(cia,C2) :country 

[urlQ(cia) -> U; nameO(cia) -» CountryName ; continent -> Cont] 
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Cont:continent.urlQ(cia) .get[hrefs®(Label) -» U] , 
pmatch(Label, "/( .*) \( [0-9] /" , "$1" , CountryName) . 

U:url.get country [urlQ(cia)->U] . 

e.g., for Belgium, this rule derives 

ObeZgc/^i :country[url@(cia)^". . . name@(cia)-»" Belgium”; continent— >Oe„r] ■ 

• Extracting scalar properties by patterns. On the CIA country pages, 
properties of countries are introduced by keywords. In a first step, several single- 
valued methods are defined using a generic pattern which consists of the method 
name to be defined, the regular expression which contains the keyword for ex- 
tracting the answer, and the datatype of the answer: 

p (country , capital , "/Capital : . *\n( .*)/", string) . 
p ( count ry, indep, "/Independence : .*\n(.*) [“(]/" .date) . 
p(country,total_area, "/total area; .*\n(.*?) sq km/" ,num) . 
pCcountry .population, "/Population: .♦\n( [0-9,]+) /".num) . 

The patterns are applied to the country.url. get-objects, representing the Web 
pages. Note the use of variables ranging over host objects, methods, and classes: 

Ctry [Method: Type -> X] :- Text Patterns 

pCcountry .Method, RegEx, Type), +Refinement 

pmatch(Ctry: country .urlQ(cia) .get, RegEx, "$1", X), 
not substr ( "million" ,X) . 

Ctry [Method: Type -> X2] :- pCcountry .Method, RegEx, Type), 
pmatchCCtry: country. urlQ(cia) .get, RegEx, "$1", X), 
pmatchCX, "/(.*) million/", "$1", XI), X2 = XI * 1000000. 

Here, we see an instance of refinement: the (generic) main rule is only valid 
if it applies to a string or to a number given by digits - but some numbers are 
given “semi-textually” , e.g., “42 million”. This is covered by the refined rule. 
Here, e.g., 

06 eisC/A[capital—>" Brussels": indep^"4 October 1830"; 
tota La rea— >30510; population— >10170241] 
is derived. Similarly, dates are extracted from strings by further matching: 

C[MQ(date)->D2] :- C:coimtry[M:date->S] , perl (fmt. date, S, D) , D ate 
pmatch(D,"/\A([0-9]{2} [0-9]{2} [0-9]{4})/", "$1", D2) . (*) 

yielding, e.g., O6e(gc/A[indep@(date)-+”04 10 1830"]. 

• Extracting multi-valued complex properties by patterns. Ethnic groups, 
religions, languages, and border countries are given by comma- separated lists of 
(string, value, unit) triples, introduced by a keyword; unit is, e.g., “%” or “km”, 
commalist (ethnicgroups , "/Ethnic divisions : . *\n( .*)/", "\'/," , string) . 
commalist (languages, "/Languages *\n( .*)/" , "\’/.", string), 
commalist (religions, "/Religions: .*\n(.*)/" , "\7,", string), 
commalist (borders , "/border coimtr [a-z] * : .*\n(.*)/", "km" , country) . 
For every pattern, the raw commalist is extracted, e.g., 

06 e;gC/A[borders : commalist —>" France 620 km, Germany 167 km, . . .”] . 
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Comments (between parentheses) are dropped by a call to the user-defined Perl- 
predicate dropcomments: 

C[M:commalist->K] commalist(M,S,_,_) , Commalists 

pmatch(C:country.url(a(cia) .get ,S, "$1" ,L) , perl (dropcomments, L,K) . 

The comma-lists are decomposed into multi-valued methods (since comments 
are dropped, e.g., Muslim(Sunni) and Muslim(Shi’a) both result in Muslim): 

C[M-»X] C: country [M:commalist->S] , Decomposing Commalists 
pmatchCS, "/([“,]*)/", (**) 

yielding Obeigc/A [borders-»{“ France 620 km", "Germany 167 km”, ...}]. 

In the next step, the values are extracted from the individual list elements. 
There is a refined rule which preliminarily assigns the null value to elements 
which contain no value (which is often the case for languages or religions); 

C[M®(X)-»V] commalist(M,_,S,_) , Decomposing Pairs 

strcat("/([~0-9]*)\s* ( [0-9\ .]+) " ,S,T) , strcat (T, , Pattern), 
pmatch(C; country . .M, Pattern, ["$1","$2"], [X,V]). 
C[M®(X)-»null] 

commalist(M,_,S,_) , C : country [M-»X] , not substr(S,X). 

yielding oteipc/A [bordersOC France" )-»620; borders@(” Germany" )-»167; ...]. 
The multi-valued methods are summed up using Florid’s sum aggregate: 

C[MQ(X)->V] commalist string) , Summation 

V = sumfW [C,M,X]; C: country [MQ(X)-»W]}, V > 0. 

In a refinement step, non-quantified elements of unary lists are now set to 100%, 
other entries without percentages remain nulls; here, a count aggregate is used. 

C[MQ(X)->#100] commalist (M,_, S, string). Refinement: 

C : country. .M9(X) [] , not C.MS(X)[], Exceptions 

N = countfX [C,M] ; C:coimtry. .M®(X) []}, N = 1. of Summation 
C[M®(X)-> null] :- commalist (M,_, S, string), 

C ; country [M® (X) -»null] , not C.MQ(X)[], 

N = count-[X [C,M] ; C ; country. .MS (X) []}, N > 1. 

For country[borders@(country)— >number], the country objects are used as argu- 
ments (M and Class are instantiated with borders and country, respectively). 
C[M®(C2)->V] :- C2 : Class [name® (cia)-»X] , commalist (M,_, S, Class), 
V = sumfVal [C,X] ; C; Class [M0(X)-»Val]>, V > 0. 
resulting in O(,ei 5 C/A[borders@(o/ronce)-+ 620 ; borders@(ogermany)->167; ...]. 

4.2 The CIA World Factbook: Wrapping Organizations 

Appendix C of the CIA World Factbook provides information about political 
and economical organizations. The pages are tagged nested lists (see Fig. 2). 
Thus, we exploit the structure of the parse-tree: For every organization, a list of 
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properties is given. The list items are then queried by pattern matching, filling 
the respective methods of the organization. The members are again given as a 
set of comma-lists (distinguished by the type of membership) of countries. 

• European Union (EU) 

O address-c/o European Commission, .... B-1049 Brussels, Belgium 
O tstablisk»4-7 February 1992 
O mtmbers-iXS) Austria, Belgium, Denmark, ... 

O membership appUcants-{12) Albania, Bulgaria, ... 

<UL> . . . <LI>European Union (EU)</LI> 

<UL> 

<LIXI>address</I>-c/o ... , B-1049 Brussels, Belgiuni</LI> 
<LI><I>established</I>-7 February 1992</LI> 

<LI><I>members</I>-(15) Austria, Belgium, Denmark, ... </LI> 
<LI><I>membership applicants</I>-(12) Albania, Bulgairia, . . . </LI> 
</UL> . . . </UL> 

Fig. 2. Excerpt of the CIA World Feictbook: Appendix C: Organizations 
U;url. parse ciaAppC[url->U] . 

For every letter, there is a list with organizations. The class u. parse. ul contains 
all lists on u - note that only upper-level lists qualify: 

L:org_list L: (ciaAppC. parse. ul) , L = ciaAppC. parse. root. htmlQ (X) . 
The org_lists (<UL>) are then analyzed by navigation in the parse-tree; The k- 
th element of the <UL> is an organization header, the fc-fl-th the corresponding 
data list. In the same step, the abbreviation and the full name are extracted 
from the header; the inner <UL> contains the properties of the organization: 
org (Short) :org[abbrev->Short; name->Long; data->Data] :- 

L:org_list[ul®(K)->Hdr] , L.ul®(Kl) [liQ(0)->Data] , K1 = K + 1, 

Hdr [li@(0)->Naine] , string (Name) , Data.ul@(0) [] , 
pmatch (Name, "/(.*) \( ( [")] *)\)/" , , [Long, Short] ) . 
derives, e.g., 

Oe„:org[abbrev— ED" ; name— »" European Union”; 
data— >" <UL><LI><l>address</l> . . . </LI> 

<LI><l>membership applicants</l>- (12) Albania,. .. </LI> 
</UL>"] . 

For every organization, several properties are listed, e.g., its address, its estab- 
lishment date, and lists of members of the different member types. For every 
property, the corresponding r-th item of the (inner) <UL> (which itself is an 
<LI>) is identified by a keyword (which is <I>-ed in the 0-th item of the <LI>), 
e.g., “address” or “established”, or a member type. 
p(org, address , "address" , string) . 
p(org,establ, "established" , date) . 
p(org, member _names , "member" ,membertype) . 

p (org, appliccmt_names, "membership applicemts" ,membertype) . 
membertype : : commalist . 
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Note that the rule is refined to skip the leading and the number of members: 

0[MethName:Type->Value2] List of Properties 

0:org[data->D] , p(org,MethName,MethString,Type) , Refined 

D.ulQCX) [li<a(0)->_[i9(0)->MethString] ; liQ(l)->Value] , 
pmatchf Value, "/\A-\s*(\( [0-9] *\))*\s*( . *)/" , "$2" , Value2) . 

derives, e.g., 

fV„[address:string— »”c/o . . , Brussels, Belgium"; estabkdate— »"7 February 1992”; 
member_names:membertype—>" Austria, Belgium, 
applicant-fiamesimembertype^" Albania, Bulgaria ..."] 

Again, by an instance of the generic rule pattern (*), the date is formatted: 

0[MQ(date)->D2] 0;org[M:date->S] , perl (fmt. date, S, D) , Date 
pmatch(D,"AA([0-9]{2} [0-9] {2> [0-9] {4})/" , "$1", D2) . (*) 

The address is analyzed by an application-specific matching rule: 
0[seatcity->Ct;seatcoimtry->Co] ;- 0:org[address->A] , 

pmatchCA, "/(.*)[, 0-9] ([“,0-9]*)\Z/" , ["$1","$2"], [Ct,Co]). 
derives, e.g.. 



Oeu[seatcity~»" Brussels"; seatcountry-*" Belgium"; establ@(date)->"07 02 1992"]. 
The commalists of the membertypes are again decomposed using the generic 
rule pattern (*♦) (remember that membertype :: commalist): 



0[M-»X] :- 0:org[M:commalist->S] , 


Decomposing Commalists 


pmatchCS , "/([“,]*)/" , "$1" ,X) . 


(»*) 



derives, e.g., Oeu[member-names:memfaertype-»{" Austria”, "Belgium” , . . . }; 

applicant-names:membertype-«{” Albania" , "Bulgaria", . . . }] . 
In a first integration step, the memberships are represented by references to the 
respective country objects: 

0 [member® (Tname)-»C] :- p (org,T,Tname, membertype ) , 

0[T:membertype-»CN] , C: country [nameS(cia)-»CN] . 
derives, e.g., Oe„[member@(" member” )-»{oa„s<HaC/A,ObeisC/A.- ■ • }]• 

4.3 Global Statistics: Wrapping Cities and Provinces 

The Global Statistics [7] Pages 
provide information about admin- 
istrative divisions (area, popula- 
tion, and capitals) and main cities 
(population and province). The 
data is given by HTML tables 
as shown at the right, but with 
an irregular structure of columns. 

Here, generic rules for analyzing 
tables are applied. Application- 



administrative dividons of Be4dum 


province capital area (km^ 


pop. Dec. 31, 1991 


Antwerp Antwerp 2,867 


1,610,695 


Brabant Brussels 3,358 


2,253,794 


main cities of Belgium 


dties p<^. 


1995 est. 


Brussels (Agglomeration) 


951,580 


Antwerp 


459,072 
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specific rules map the semantics of the columns to the object-oriented model. 
Additionally, layout specialties have to be dealt with, such as twocolumn layout 
and multilevel administrative divisions. After working through the continent 
pages for getting the country urls, these are accessed and parsed by the built-in 
SGML parser: 

U:country_url. parse _:coimtry[urlQ(gs)->U] . 

The class u. parse. table contains all tables of u. Based on the <TR> and <TD> 
structure, the tables are mapped to a matrix (by several refinements of the 
following rule): 

elem(T, Row, Column) [cont->Cont;type->Type] :- Tables: 

U : country_url , T : (U . parse . table) , Elements 

T.tableS(O) [tbody0(Row)->_[tr®(Column)->X[Type®(O)->Cont]]] . 

Here, e.g., the administrative divisions table is the second element in the page 
body, i.e., 

06eisGS-url@(gs).parse.body@(2) : 06ejgGS-url@(gs).parse.table. 
elem(of,e;gGs.url@(gs).parse.body@(2),2,3)[cont— »"area (km^)” ;type— >"th"] 
elem(o6eigGS url@(gs).parse.body@(2),4,3)[cont— >"3,358" ;type— >"td"] . 

In a first step, information about column headers is extracted: 

T[header_row-»HR] :- U:country_url, Tables: 

T: (U. parse. table) , elem(T,HR,_) [type->th] . Column Headers 
T [ header® (HR, Col) -»String] 

U : country_url , T : (U . parse . table) [header _row-»HR] , 
elem(T,HR,Col) [cont->String;type->th] . 

leading to, e.g., 

06eipGS'url@(gs).parse.body@(2)[header.row^{l}; 

header@(3,l)^" province" ;. . . ;header@(3,4)— >” pop. Dec.31,1991"]. 
Then, the tables are analyzed by application-specific methods, e.g., the table of 
administrative divisions and its columns are identified. For identification of a 
column, a set of keywords can be given: 

C [adm_div_tab -> T : admdivtab [name_col->0] ] :- 

C : country [url® (gs) ->U] , T: (U. parse. table) , elem(T,_,_) [cont->I] , 
pmat ch (I , "/administrative divisions of (.*)/", "$1" , Name) . 
area_col : admdivtabcol [header-»{"area")-] . 
pop_col :admdivtabcol [header-»-["population"}-] . 
cap_col : admdivtabcol [header-»-[" capital" , "hauptstadt"}] . 

T[CName->Col] :- CName : admdivtabcol [header-»H] , Table Columns: 
T: admdivtab [header®(_ ,Col)-»S] , substr(H,S) . Identification 

Here, 06 e;pG 5 [admdivtab^O(,eipGs url@(gs).parse.body@(2)], and 

06eigGs admdivtab[name_col^l; area.col=fd 3; pop_col->4; cap_col-^2]. 
Based on the header-entries, the data is extracted and transformed into an 
object-oriented representation: 
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C[adm_divs ->> prov(C,R) :prov 

[country->C; name_str->PN; population->P; area->A]] 

C: country [adm_div_tab -> T[name_col->NS;area_col->AS;pop_col->PS]] , 
elem(T,R,NS) [cont->PN;type->td] , elem(T,R, AS) [cont->A;type->td] , 
elem(T,R,PS) [cont->P; type->td] . 

If a capital column exists, it is also evaluated: 

C[adm_divs -» prov(C,R) [capital_str->CN]] 

C: country [adm_div_tab -> T[cap_col->CS]] , prov(C,R) :prov, 
elem(T,R,CS) [cont->CN; type->td] . 
deriving, e.g., 06e/gC?s[adm_divs^{op_a„f,„erp}] and 

Op-onti«erp[name_str-+" Antwerp": population-^1610695; 
area— >3358; capital_str^" Antwerp"] . 

Analogously, the table of main cities is evaluated. Then, the references from cities 
to provinces, and the references from provinces to their capitals are generated: 
Cty [province->>Prov] :- Cty:city[country->C;prov_name->PN] , 

Prov:prov[country->C;name->PN] . 

P[capital->Capcty] :- P:prov[country->C;name->PN; capital _name->CN] , 
Capcty : city [country->C ; name® (gs) -»CN ; province-»P [name->PN] ] . 

4.4 Mediation 

After transforming the contents of the Web pages into F-Logic models by wrap- 
per rules, the integration and restructuring of the data in the object-oriented 
model can be performed. 

The databases generated by the above wrapper rules are integrated into a 
common object-oriented schema. Here, the countries of CIA and Global Statistics 
have to be matched, either by name 

Cl = C2 :- Cl:Class[name@(A)->N] , C2 : Class [name® (B) -»N] . Fusion 

(which would derive ObeigCiA = Obeigos) or (if different names are used) if they 
belong to the same continent and the capitals’ names are the same: 

Cl = C2 :- 

Cl : country[name®(cia)-»_; continent->CT; capitalS(cia) ->N] , 

C2 : coimtry [continent®(gs)-»CT] , C2. capital [name® (gs)-»N] . 

For creating intersource links, generic rules using meta knowledge about the 
schema (keys, referential constraints) are employed. Here, we give only an appli- 
cation-specific rule which links seats of organizations (in CIA) to the respective 
cities (in GlobalStatistics): 

0[seat->Cty] :- 0 : org[seatcity->N; seatcountry->CN] , 

Cty: city [name-» N] , Cty. country [neune® (cia) -»CN] . 
deriving, e.g., Oe„ [seat ^06russel5] ■ 

The rules given are slightly simplified wrt. the original programs [15] to illustrate 
the principles of information extraction from the Web in our approach. In the 
full program, more exceptions and formatting issues have been faced: these lead 
to refined rules. 
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5 Conclusion and Related Work 

Our approach for extracting and integrating information from the Web is based 
on (i) a unified object-oriented model of the Web and the application domain 
and (ii) an integrated rule-based language for wrapping and mediation which 
allows reuse of generic rule patterns for different applications. By instantiating 
generic rule patterns for handling logical markup and textual formatting and 
typical situations of data integration, a program skeleton is assembled which 
can be completed by application-specific rules and refining rules. 

Existing approaches for information extraction employ different languages 
for wrappers and mediators. In the Araneus project [3] , different languages are 
used for extracting data and defining views on it in a hypertext-based model. 
Tsimmis [8] uses OEM (Object Exchange Model) as a common data model. Here, 
different languages are used for wrapping (WSL), mediation (MSL) and querying 
(MSL, Lorel). For a recent overview of systems, see [6]. The above approaches 
use hand-coded wrappers. [9] describe a grammar-based tool for coding wrappers 
producing OEM output. 

Jedi [11] is a tool for manually specifying wrappers for HTML pages by a 
framework combining grammars and rules. W4F [17] is a toolkit for interactively 
generating wrappers for HTML pages using HEL, a DOM-based language op- 
erating on the parse-tree of a document. HEL focusses on wrapping single Web 
pages; Web exploration and information integration is left to the application. 
Comma-lists and name- value pairs are regarded as atomic - i.e., the result is 
not completely mapped to the application domain, requiring a manually speci- 
fied application-specific postprocessing. Another tool for interactive generation 
of wrappers using the DOM model is presented in NoDoSE [1] . 

[2] present a grammar-based approach for semiautomatically wrapping HTML 
pages. Their approach does not support HTML tables or any textual formatting 
capabilities such as commalists or comma-pair-lists - which is a common way 
for structuring data on Web pages. Thus, the extraction of the data as shown in 
Section 4 is not possible with their wrappers. 

[13] present inductive learning methods for wrapping HTML pages with a 
tabular layout and report a 48% success. In [5], higher than 90% success is 
reported for automatical matching-based wrapper generation for data-rich and 
ontological narrow sources. 

Currently, (semi-) automatical approaches for wrapper-generation do not (yet) 
provide a sufficiently fine granularity for exhaustively extracting information 
from semistructured sources - especially if these use textual formatting. Indeed, 
given the many idiosyncrasies of real-world sources, it seems more viable to em- 
ploy high-level declarative languages for rapid prototyping and refinement of 
wrappers. Here, a combination of automatically pre-generating wrapper skele- 
tons which are then refined manually could be useful. 

To our knowledge, for the above-mentioned tools, no complete case-study 
comprising wrapping and integration of several sites has been published. We 
claim that the case study [15] would not be possible with the above-mentioned 
(semi-) automatic approaches. Wrapping would be possible with the wrapping 
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tools Jedi [11], NoDoSE [1], and W4F [17], but these miss suitable integration 
functionality - which has then to be done in a different framework. 
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Abstract. When publishing documents on the Web, the user needs to 
describe and classify her documents for the benefit of later retrieval and 
use. This paper presents an approach to semantic document classifica- 
tion and retrieval based on Natural Language Processing and Conceptual 
Modeling. The Referent Model language is used in combination with a 
lexical analysis tool to define a controlled vocabulary for classifying doc- 
uments. Documents are classified by means of sentences that contain 
the high frequency words in the document that also occur in the do- 
main model defining the vocabulary. The sentences are parsed using a 
DCG-like grammar, mapped into a Referent Model fragment and stored 
along with the document using RDF-XML syntax. The model fragment 
represents the connection between the document and the domain model 
and serves as a document index. The approach is being implemented for 
a document collection published by the Norwegian Center for Medical 
Informatics (KITH). 



1 Introduction 

Project groups, communities and organizations today turn to the Web to dis- 
tribute and exchange information. While the Web makes it fairly easy to make 
information available, it is more difficult to find an efficient way to organize, de- 
scribe, classify, and present the information for the benefit of later retrieval and 
use. One of the most challenging tasks in this respect is the semantic classifica- 
tion of documents. This is often done using a mixture of text analysis methods, 
carefully defined (or controlled) vocabularies or ontologies, and certain schemes 
for how to apply these vocabularies when describing documents. 

Conceptual modeling languages has the formal basis that is necessary for 
defining a proper ontology and they also offer visual representations that enable 
users to take active part in the modeling. Used as a vehicle for defining ontologies, 
they let the users graphically work on the vocabulary and yield domain models 
of indexing terms and their relationships, which may later be used interactively 
to classify, browse and search for documents. 

In our approach, conceptual modeling is combined with Natural Language 
Processing (NLP). Rather than indexing the documents with sets of unrelated 
keywords, we construct indices that represent small fragments of the domain 
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model. These indices do not only tell us which concepts were included in the text, 
but also how these concepts are linked together in more meaningful units. NLP 
allows users to express dear-text sentences that contain the selected concepts. 
These sentences are parsed using a DCG-like grammar and mapped into a model 
fragment. Searching for a document, the user enters a natural language phrase 
that is matched against the document indices. 

Approaches to metadata descriptions of documents on the Web today should 
conform to the emerging standard for describing networked resources, the Re- 
source Description Framework (RDF). We show how our indices of model frag- 
ments may be mapped into an RDF representation and stored using the RDF- 
XML serialization syntax. In our approach, the domain model serves as the 
vocabulary for defining RDF statements, while NLP represents an interface to 
the creation of such metadata statements. 

The next section of this paper presents the theoretical background of our 
approach and refers to related work. Section 3 describes the situation at KITH^ 
today, the case study that has motivated the approach, and Section 4 discusses 
our proposed solution. The approach is elaborated through a working example 
from KITH. Section 5 concludes the paper and points to some further work. 

2 Describing Documents on the Web 

To some extent, organization and description of information should be done by 
the users themselves. In cooperative settings, users have to take active part in 
the creation and maintenance of a common information space supporting their 
work [27, 3]. The need for — and the use of — information descriptions are often 
situation dependent. Within a group or community, the users themselves know 
the meaning of the information to be presented and the context in which it is 
to be used. What is often needed is tools helping the users to participate in 
the organization and description of documents. A definition on how to present 
information on the Web is given in a 3-step reference model by [18]: 

— Modularization: Find a suitable storage format for the information to be 
presented. 

— Description: Describe the information for the benefit of later retrieval and 
use. Document-descriptive metadata may be divided into two categories: 
Contextual and Semantic descriptive metadata: 

• Contextual metadata: Any contextual property of the document like its 
author, title, modified date, location, owner, etc. and should also strive 
to link the document to any related aspect such as projects, people, 
group, tasks etc. 

• Semantic metadata: Information describing the intellectual content /- 
meaning of the document such as selected keywords, written abstracts 
and text-indices. In cooperative settings or in organizational memory 
approaches, also free-text descriptions, annotations or collected commu- 
nication/discussion may be used. 

^ Norwegian Center for medical Informatics, www.kith.no. 
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— Reading-functions: On-line presentation of information enables the “pub- 
lisher” to provide advanced reading-functions such as searching, browsing, 
automatic routing of documents, notification or awareness of document 
changes or the ability to comment or annotate documents. 

In this paper our focus is the semantic classification of documents using key- 
words from a controlled vocabulary. What we need then, is a way of defining the 
vocabulary and a way of applying it when classifying the individual documents. 
In collaborative settings, the users should be able to define and use their own 
domain specific ontology as the controlled vocabulary. Furthermore, they need 
an interface to classification that allows them to work with both the text and 
the domain model, and to create meaningful indices that connects the text to 
the model. 

When the documents are to be made available to the public, the interface 
of the retrieval system cannot assume any particular knowledge of the domain. 
As the users will not necessarily know the particularities of the interface either, 
the search expressions must be obvious and reflect a natural way of specifying 
an information need. Adopting natural language interfaces is one way of dealing 
with this, though the flexibility of natural languages is a huge challenge and can 
easily hamper the efficiency of the search. 

2.1 Related Work 

Document type or document structure models and definitions have for some time 
been used to recognize and extract information from texts [21|. Data models of 
contextual metadata are used in order to design, set up and configure presen- 
tation and exchange on the Web [17,2,25]. Together with “dataweb” solutions, 
data models are also used to define the export schema of data that populate 
Web pages created or materialized from underlying databases. 

Shared or common information space systems in the area of CSCW, like 
BSCW [4], ICE [7] or FirstClass [10] mostly use (small) static contextual meta- 
data schemes, connecting documents to e.g. people, tasks, or projects, and rely 
on freely selected ke 5 rwords, or free- text descriptions for the semantic descrip- 
tion. Team Wave workplace [34] uses a “concept map” tool, where users may 
collaboratively define and outline concepts and ideas as a way of structuring 
discussion. There is however no explicit way of utilizing this concept graph in 
the classification of information. 

Ontologies [35, 13, 15] are nowadays being collaboratively created [12, 5] across 
the Web, and applied to search and classification of documents. Ontobroker [8, 
9] or Ontosaurus [33] allows users to search and also annotate HTML documents 
with “ontological information” . Hand-crafted Web-directories like Yahoo, or the 
collaborative Mozilla “OpenDirectory” initiative offers improved searches, by 
utilizing the directory-hierarchy as a context for search refinement [1]. Domain- 
specific ontologies or thesauri are used to refine and improve search-expressions 
as in [22]. Domain specific ontologies or schemas are also used together with infor- 
mation extractors or wrappers in order to harvest records of data from data-rich 
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documents such as advertisements, movie reviews, financial summaries, etc. [6, 
11,14], 

[23] uses a hierarchy of generic concepts, together with a WordNet thesauri 
and metadata in order to organize information and to search in networked multi- 
databases. Collaboratively created concept-structures and concept-definitions 
are used in Knowledge Management [36,37] and Organizational Memory ap- 
proaches [29] . Many of these approaches use textual descriptions (plus annota- 
tions and communications) attached to documents as the semantic classification. 

Whereas early information retrieval systems made use of keywords only, 
newer systems have experimented with natural language interfaces and linguistic 
methods for constructing document indices and assessing the user’s information 
needs. As natural language interfaces use phrases instead of sets of search terms, 
they generally increase precision at the expense of recall [32,28]. Some natural 
language interfaces only take into consideration the proximity of the words in 
the documents; that is, a document is returned if all the words in the search 
phrase appear in a certain proximity in the document. In other systems the 
search phrases are matched with structured indices reflecting some linguistic or 
logical theory, e.g. [20,26]. Finding an appropriate representation of these in- 
dices or semantic classifications is central in many IR projects and crucial for 
the NLP-based IR systems. 

Naturally, in order to facilitate information exchange and discovery, also sev- 
eral “Web-standards” are approaching. The Dublin Core [38] initiative gives a 
recommendation of 15 attributes for describing networked resources. W3C’s Re- 
source Description Framework, [39] applies a semantic network inspired language 
that can be used to issue metadata statements of published documents. RDF- 
statements may be serialized in XML and stored with the document itself. Pure 
RDF does not, however, include a facility for specifying the vocabulary used in 
metadata statements. For this, users may rely on XML namespace techniques in 
order to define their own sets of names that can be used in the RDF-XML tags. 

3 KITH’s Document Collection 

A Norwegian center of medical informatics — KITH — has the editorial respon- 
sibility for creating and publishing ontologies covering various medical domains 
like physiotherapy, somatic hospitals and psychological health care. These on- 
tologies take the form of a list of selected terms from the domain, their textual 
definition (possibly with examples) and cross-references among them. Once cre- 
ated, the ontologies are distributed in print to various actors in the domain and 
are used to improve communication and information exchange in general, and 
in information systems development projects. 

The ontologies are created on the basis of a selected set of documents from the 
domain. The process is outlined in Figure 1. The set of documents is run through 
a lexical analysis, using the WordSmith toolkit [30]. The lexical analysis filters 
out the non-interesting stop words and also matches the documents against a 
referent set of documents assumed to contain “average” Norwegian language. 
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Fig. 1. KITH example — constructing the ontology / modeling process 



The result of the lexical analysis is a list of approximately 700 words that occur 
frequently in (documents from) this domain. 

Words are then carefully selected and defined through a manual and co- 
operative process that involves both computer scientists and doctors. In some 
cases, this process is distributed and asynchronous, where participants exchange 
suggestions and communicate across the Web. Conceptual modeling is used to 
create an overview — and a structuring — of the most central terms and their 
inter-relations. The final result of this process is a MS Word document that 
include some overview models and approximately 100-200 well-defined terms. 

This is where the process stops today. KITH’s goal is to extend the process 
and publish the documents on the Web, organized and classified according to 
the developed domain ontology. 



4 Referent-Model Based Classification of Documents 

An overview of our proposed approach for publishing the medical documents on 
the Web is depicted in Figure 2. The documents are classified according to the 
following strategy; 

1. Matching: The document text is analjrzed and matched against the concep- 
tual model. The system provides the user with a list of the high-frequency 
words from the document that are also represented in the model. The user 
may interact with the model and select those concepts that are most suit- 
able to represent the document content. The user then formulates a number 
of sentences that contain these concepts and reflect her impression of the 
document content. 

2. Clcissification and Translation: The system parses the sentences using a 
Definite Clause Grammar, producing logical formulas for each sentence [24]. 
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Fig. 2. KITH example: model based classification documents - proposed approach 



These formulas come in the form of predicate structures, e.g. contain(journal, 
diagnosis) for the sentence “a journal contains a diagnosis” , and refer to con- 
cepts defined in the domain model. From the formulas, the system proposes 
possible conceptual model fragments that serve as indices/classifications of 
the document and are represented in RDF. In case there are several possible 
interpretations of the sentence in terms of conceptual model fragments, the 
user is asked to choose the interpretation closest to her information needs. 

3. Presentation: Both the model, the definition of terms in the model and 
the documents themselves should be presented on the Web. The model rep- 
resents an overview of the document domain. Users may browse and explore 
the model and the definitions in order to get to know the subject domain. 
Direct interaction with the model should be enabled, allowing users to click 
and search in the stored documents. Exploiting the capabilities of NLP, the 
system should also offer free-text search. In both search approaches, the 
model is used to refine the search, letting the user replace search terms with 
specialized or generalized terms or adding other related terms to the search 
expression. 
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Fig. 3. Referent Model Lemguage - syntax by example 



4.1 Referent Model Language 

The semantic modeling language used, the Referent Model Language [31], has 
a simple and compact graphical notation that makes it fairly easy for end-users 
to understand. The language also supports the full set of CAGA-abstraction 
mechanisms [19], which is necessary to model semantic relations between terms 
like synonyms, homonyms, common, and part terms. The syntax of the language 
is shown in Figure 3. The basic constructs are referent sets and individuals, their 
properties and relations. Sets, individual constructs and relational constructs 
all have their straightforward counterpart in basic set theory. Properties are 
connected to referent sets, where each property represents a relation from each 
member of the set into a value set. 

Relations are represented by simple lines, using arrows to point from many to 
one and filled circles to indicate full coverage. Several relations may be defined 
between any two sets. Sometimes relations consist of the same member-tuples, 
sometimes one relation is defined as a composition of other relations. Such re- 
lations are often referred to as derived relations. Derived relations are specified 
using the composition operator o. 

All general abstraction mechanisms classification, aggregation, generalization 
and association (CAGA) are supported by the language. The example to the 
right in Figure 3 shows a world of cats. The set of all Cats may be divided 
into two disjoint sets, male (Catts) and female (Cattes). The filled circle on the 
disjoint symbol indicates that this is a total partition of the set of cats. That 
is, every Cat is perceived to be either female or male. On the other hand, there 
are several other ways of dividing the set of cats, e.g. House cats, Persian cats. 
Angora cats etc. These sets may be overlapping, i.e. a cat may be both a house 
cat and an angora cat. The absence of a filled circle indicates that there may 
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be other kinds of cats as well that are not included in the model (e.g. wild cats, 
lost cats). 

We have also defined a cat as an aggregation of cat parts (heads, bodies, 
legs and tails). The use of ordinary relations from the aggregation symbol to 
the different part sets shows how a cat is constructed from these parts. A cat 
may have up to 4 legs but each leg belongs to only one cat. Heads and bodies 
participate in a 1:1 correspondence with a cat, that is a cat has only one head 
and one body. A cat must (filled circle) have a head and a body, while legs and 
tails are considered optional (no filled circle). 



4.2 Example from the KITH Collection 

Figure 4 shows a test interface for classification. The referent model shown at 
the top of Figure 4a is a translated fragment^ of the model for the domain 
somatic hospitals. The model shows how a patient consults some kind of health 
institution, gets her diagnosis, then possibly receives treatment and medication. 
All relevant information regarding patients and their treatment is recorded in a 
medical patient-journal. 

We have selected a test document,^ matched it against terms found in the 
referent model and selected sentences (c) containing words found in the model. 
The applet in Figure 4b shows how the user can have the domain model (a) 
available, work with the sentences, select words and fragments of these, and 
have them translated into RDF-XML statements (d). 

The translation of NL expressions to RDF starts with a DCG-based pars- 
ing of the sentences. The production rules of DCG are extended with logical 
predicates that combine as the syntax tree of the sentence is built up. While 
parsing a sentence like “Journals should be kept for each patient,” for example, 
a logical predicate for the whole sentence is constructed. Some modifications of 
traditional Definite Clause Grammars are made, so that modal verbs are not in- 
cluded in the logical predicates, and prepositions are added to the verbs rather 
than to the nominals. For the analysis of the sentence above, thus, the predi- 
cate kept_for(journal, patient) is returned. After generating a set of predicates 
representing the sentences chosen by the user, the system maps the sentences to 
RDF expressions. Predicates show up as RDF relations, and their arguments are 
mapped onto RDF nodes. The logical mapping system is based on the abductive 
system introduced in [16]. 

This way, the users themselves create RDF-statements that serve as a se- 
mantic classification of the document. Their own domain model is used as the 
vocabulary for creating these statements, and the statements are created on the 
basis of a linguistic analysis of sentences describing the document. 



^ The term definition catalogue for this domain contains 112 Norwegian terms. 

® “Regulation for doctors and health institutions’ keeping of patient journals” (in 
Norwegian) , http : / / www . helset ilsynet . no/regelver /f orskr if /f 89-277 . htm 
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Fig. 4. Selection of sentences sind translation to RDF 



4.3 The RML — RDF Connection 

The RDF statements relate the content of the document to the terminology 
defined in the domain model. Words from the selected sentences that have a 
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Fig. 5. RDF statements created on the basis of sentences shown in Figure 4 



corresponding referent-set defined in the model are represented as RDF-nodes. 
Also, words that can be said to correspond to members of a referent set, may 
be defined as nodes. RDF-properties are then defined between these RDF-nodes 
by finding relations in the referent model that correspond with the text in the 
selected sentences. 

As the original models contain so few named relations, we may say that the 
creation of properties represents a “naming” of relations found in the model. 
In case of naming conflicts between relation names in the model and the text, 
we use the relation names found in the model. In some cases, we may need 
to use names from derived relations. This may be done either by selecting the 
entire composition path as the relation name or by expanding this path into 
several RDF statements. The latter implies that we have to include RDF-nodes 
from referents that may not be originally mentioned in the sentences. Additional 
information, such as attributes, is given as RDF lexical nodes. A lexical node 
need not be mentioned in the referent model. 

Figure 5 shows the RDF statements created on the basis of the model and 
the selected sentences from Figure 4c. The created RDF nodes correspond to 
the gray referents in the shown model. In Figure 5, derived relations are written 
using the composition symbol. The nodes and the derived relations together form 
a path in the referent model. This path represents the semantic unit that will 
be used as the classification of the document. 

The RDF statements are translated into RDF-XML serialization syntax, as 
defined in [39] (shown in Figure 4d). Here, the referent model correspondences 
are given as attributes in the RDF-XML tags. In order to export document 
classifications using XML representations, we should be able to generate XML 
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Document Type Definitions (DTD’s) and namespace declarations directly from 
the referent model. 

5 Conclusions 

We have presented an approach to semantic document classification and search 
on the Web, in which conceptual modeling and natural language processing 
are combined. The conceptual modeling language enables the users to define 
their own domain ontology. From the KITH approach today, we have a way 
of developing the conceptual model based on a textual analysis of documents 
from the domain. The conceptual modeling language offers a visual definition 
of terms and their relations that may be used interactively in an interface to 
both classification and search. The model is also formal enough for the ontology 
to be useful in machine-based information retrieval. We have shown through an 
example how the developed domain model may be used directly in an interface to 
both classification and search. In our approach, the model represents a vehicle for 
the users throughout the whole process of classifying and presenting documents 
on the Web. 

The natural language interface gives the users a natural way of expressing 
document classifications and search expressions. We connect the natural lan- 
guage interface to that of the conceptual model, so that users are able to write 
natural language classifications that reflects the semantics of a document in 
terms of concepts found in the domain. Likewise, users searching for informa- 
tion may browse the conceptual model and formulate domain specific natural 
language search expressions. 

The document classifications are parsed by means of a DCG-based grammar 
and mapped into Resource Description Framework (RDF) statements. These de- 
scriptions are stored using the proposed RDF-XML serialization syntax, keeping 
the approach in line with the emerging Web standards for document descriptions. 

Today, we have a visual editor for the Referent Model Language that exports 
XML representations of the models. We have experimented with Java servlets 
that match documents against model fragments and produce HTML links from 
concepts in the document to their textual definitions found in the ontology. We 
also have a Norwegian electronic lexicon, based on the most extensive public 
dictionary in Norway, that will be used in the linguistic analysis of sentences. 
Some early work on the DCG parsing has been done, though it has not been 
adopted to the particular needs of this application yet. We are currently working 
on a Java-based implementation of the whole system. Our goal is to provide a 
user interface that may run within a Web browser and support the classification 
and presentation of documents across the Web. 
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Abstract. The World-Wide Web is am enormous, distributed, and het- 
erogeneous information space. Currently, with the growth of available 
data, finding interesting information is difficult. Search engines like Al- 
tavista are useful, but their results are not adways satisfactory. In this 
paper, we present a method called Knowledge Discovery on the Web for 
extracting connections between terms. The knowledge in these connec- 
tions is used for query expansion. We present experiments performed 
with our system, which is based on the SMART retrieval system. We 
used the comparative precision method for evaluating our system against 
three well-known Web seatfch engines on a collection of 60,000 Web pages. 
These pages are a snapshot of the IMAG domain and were captured us- 
ing the CLIPS-Index spider. We show how the knowledge discovered can 
be useful for seau-ch engines. 



1 Introduction 

The task of an Information Retrieval (IR) system is to process a collection of 
electronic documents (a corpus), in such a way that users can retrieve documents 
from the corpus whose content is relevant to their information need. In constrast 
to Database Management Systems (DBMS), an IR user expresses queries indi- 
cating the semantic content of documents s/he seeks. The IR system processes 
some knowledge contained inside documents. 

We highlight two principal tasks in an IR system: 

— Indexing: the extraction and storage of the semantic content of the docu- 
ments. This phase requires a model of representation of these contents, called 
a document model. 

— Querying: the representation of the user’s information need (generally in 
query form), the retrieval task, and the presentation of the results. This 
phase requires a model of representation of the user’s need, called a query 
model, and a matching function that evaluates the relevance of documents 
to the query. 

IR systems have typically been used for document databases, like corpora 
of medical data, bibliographical data, technical documents, and so forth. Such 
documents are often expressed in multimedia format. The recent explosion of 
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the Internet, and particularly the World Wide Web, has created a relatively new 
field where Information Retrieval (IR) is applicable. 

Indeed, the volume of information available and the number of users on the 
Web has grown considerably. In July 1998, the number of users on the Web was 
estimated at 119 million (NUA Ltd Internet survey, July 1998). The number 
of accessible pages was estimated in December 1997 at 320 million [21], and 
in February 1999 at 800 million [22]. Moreover, the Web is an immense and 
sometimes chaotic information spaee without a central authority to supervise 
and control what occurs there. In this context, and in spite of standardization 
efforts, this collection of documents is very heterogeneous, both in terms of 
contents and presentation (according to one survey [4], the HTML standard is 
observed in less than 7% of Web pages). We can expect to find about all and 
anything there: in this context, retrieving precise information seems to be like 
finding the provebial “needle in the haystack.” 

The current grand challenge for IR research is to help people to profit from 
the huge amount of resources existing on the Web. But there does not yet exist an 
approach that satisfies this information need both effectively and efficiently. In 
IR terms, effectiveness is measured by precision and recall, efficiency is a measure 
of system resource utilization (e.g. memory usage or network load). To help users 
find information, search engines (e.g Altavista, Excite, HotBot, and Lycos [27]) 
are vailable on the Web. They make it possible to find pages according to various 
criteria that relate mainly to the content of documents. These engines process 
significant volumes of documents with several tens of millions of indexed pages. 
They are nevertheless very fast and are able to handle several thousand queries 
per second. In spite of all these efforts, the answers provided by these systems 
are generally not very satisfactory: they are often too many, disturbed, not very 
precise with a lot of noise. Preliminary results obtained with a test collection 
of the TREC Conference Web TVack showed the poor results quality of 5 well 
known search engines, compared with those of 6 systems taking part in TREC 
[17]. 

Berghel considers the current search engines as primitive versions of the fu- 
ture means of information access on the Web [5], mainly because of their incapac- 
ity to distinguish the “good” from the “bad” in this huge amount of information. 
According to him, this evolution cannot be done by simple improvements to cur- 
rent methods, but require on the contrary the development of new approaches 
(intelligent agents, tools for information personalization, push tools, etc.) di- 
rected towards filtering and a more line analysis of information. A new trend 
in research in information gathering from the Web is known as knowledge dis- 
covery, which is defined as the process of extracting knowledge from discovered 
resources or the global network as a whole. We will discuss process and show 
how to use it in an IR system. 

This paper is organized as follows: after the presentation of related work in 
Section 2 and our approach in Section 3, we present in Section 4 how we can 
evaluate an IR system on the Web. We present our experimentation in Section 
5, while Section 6 gives a conclusion and some future directions. 
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2 Related Work 

Data Mining (DM) and Knowledge Discovery in Databases (KDD) [11, 12] are at 
the intersection of several fields: statistics, databases, and artificial intelligence. 
These fields are concerned with structured data, although most of the data 
available is not structured, like textual collections or Web pages. Nowadays, 
research using data mining techniques is conducted in many areas; text mining 
[1,10,15], image mining [25], multimedia data mining [30], Web mining, etc. 
Cooley et al. [9] define Web Mining as: 

Definition 1 Web Mining : the discovery and analysis of useful information 
from the World- Wide Web including the automatic search of information re- 
sources available on-line, i.e. Web content mining, and the discovery of user 
access patterns from Web servers, i.e. Web usage mining. 

So Web mining is the application of data mining techniques to Web resources. 
Web content mining is the process of discovering knowledge from the documents 
content or their description. Web usage mining is the process of extracting in- 
teresting patterns from Web access logs. We can add Web structure mining to 
these categories. Web structure mining is the process of inferring knowledge from 
the Web organization and links between references and referents in the Web. We 
have focused our research on Web content mining. Some research in this area 
tries to organize Web information into structured collections of resources and 
then uses standard database query mechanisms for information retrieval. 

According to Hsu [19], semi-structured documents using HTML tags are 
sufficiently structured to allow semantic extraction of usable information. He 
defines some templates, and then determines which one matches a document 
best. Then the document is structured following the chosen template. Atzeni, 
in the ARANEUS project [2], aims to manage Web documents, including hy- 
pertext structure, using the ARANEUS document model (ADM) based on a 
DBMS. Hammer stores HTML pages using the document model OEM (Object 
Exchange Model) [18] in a very structured manner. These methods depend on 
the specification of document models. For a given corpus, it is thus necessary 
either to have an existing a priori knowledge of “patterns”, or to specify and 
update a set of classes of pages covering all the corpus. This very structured 
approach does not seem to be adapted to the Web, which is not well structured. 
Nestorov considers that in the context of the Web, we generally do not have any 
pre-established patterns [23] . According to Nestorov, corpus size and page diver- 
sity makes it very difficult to use these methods. Indeed, even if some Web pages 
are highly structured, this structure is too irregular to be modeled efficiently 
with structured models like the relational model or the object model. 

Other research has focused on software agents dealing with heterogeneous 
sources of information. Many software agents have used clustering techniques in 
order to filter, retrieve, and categorize documents available on the Web. Theil- 
mann et al. [28] propose an approach called domain experts. A domain is any 
area of knowledge in which the contained information is semantically related. A 
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domain expert uses mobile filter agents that go to specific sites and examine re- 
mote documents for their relevance to the expert’s domain. The initial knowledge 
is acquired from URL’s collected from existing search engines. Theilmann et al. 
have used only knowledge that describe documents: keywords, metakeywords, 
URL’s, date of last modification, author; and this knowledge is represented in 
facets. 

Clustering techniques are often used in the IR domain but measures of simi- 
larity based on distance or probability functions are too ad-hoc to give satisfac- 
tory results in a Web context. Others use a priori knowledge. For example, the 
reuse of an existing generic thesaurus such as Wordnet [24] has produced few 
successful experiments [29] because there is no agreement about the potential 
that can offer, the adequacy of its structure for information retrieval, and the 
automatic disambiguation of documents and queries (especially short queries). 

A data mining method that does not require pre-specified ad-hoc distance 
functions and is able to automatically discover associations is the association 
rules method. A few researchers have used this method to mine the Web. For 
example, Moore et al. [16] have used the association rules method to build clus- 
tering schema. The experiment of this technique gives better results than the 
classical clustering techniques such as hierarchical agglomeration clustering and 
Bayesian classification methods. But these authors have experimented only with 
98 Web pages, which seems to be too few (in the Web context) to be significant. 

We now discuss how to use association rules to discover knowledge about con- 
nections between terms, without changing the document structure and without 
any pre-established thesaurus, hierarchical classification, or background knowl- 
edge. We will show how it can be advantageous for an IR system. 

3 Our Approach in the Use of Web Mining 

We use association rules to discover knowldege on the Web by extracting rela- 
tionships between terms. Using these relationships, people who are unfamiliar 
with a Web domain or who don’t have a precise idea about what they are seek- 
ing will obtain the context of terms, which gives an indication of the possible 
definition of the term. Our approach is built on the following assumptions: 

Hypothesis 1 (Terms and knowledge) ; Terms are indicators of knowledge. 
For a collection of a domain, knowledge can be extracted about the context leading 
to the use of terms. By use, we mean not only term frequencies, but also the ways 
a term is used in a sentence. In this paper we only take into account the use of 
terms together in the same document. 

By “terms” , we mean words that contribute to the meaning of a document, so 
we exclude common words. We emphasize the participation of words together to 
build up the global semantics of a document. So we consider two words appearing 
in the same document to be semantically linked. We make these links explicit 
using association rules described in the next section. 




338 



M. Gery, M.H. Haddad 



3.1 Association Rules 

One of the data mining functions in databases is the comprehensive process 
of discovering links between data using the association rules method. Formally 
[3], let 7 = {ti, ^ 2 , •■••1 be a set of items, T = {Ai, A 2 , A„} be a set of 
transactions, where each transaction Xi is a subset of items: Xi C I. We say that 
a transaction Xi contains another transaction Xj if Xj C Xi. An association 
rule is a relation between two transactions, denoted as X => F, where at least 
X r\Y = 0. We define several properties to characterize association rules: 

— the rule X => Y has support s in the transaction set T if s% of transactions 
in T contain X 01 Y. 

— the rule X => F holds in the transaction set T with confidence c if c% of 
transactions in T that contain X also contain F. 

The problem of mining association rules is to generate all association rules 
that have support and confidence greater than some specified minimums. 

We adopt the approach of association rules in the following way: we handle 
documents as transactions and terms as items. For example, suppose we are 
interested in relations between two terms. Let tj and tk be two terms, let 7? be a 
document set, let X = {tj}, Y = {t*,} be two subsets of documents containing 
only one term each, and let X ^ F be an association rule. The intuitive meaning 
of such a rule is that documents which contain the term tj also tend to contain 
the term tk- Thus, some connection can be deduced between tj and tk- The 
quantity of association rules tends to be significant, so we should fix thresholds, 
depending on the collection and on the future use of the connections. Thus, 
measures of support and confidence are interpreted in the following way: 

— the rule X => F has support s in document set D if s% of the documents in 
D contain X and F. 

support{X ^ F) = Prob{X andY). (1) 

This is the percentage of documents containing both X and F. We fixed a 
minimum support threshold and a maximum support threshold to eliminate 
very rare and very frequent rules that are not useful. 

— the rule X => F holds in document set D with confidence c if c% of the 
documents in D that contain X also contain F. 

confidence{X =>Y) = Prob{Y \ X) = (2) 

This is the percentage of documents containing both X and F, assuming 
the existence of X in one of them. We have fixed a minimum confidence 
threshold. 

At this point of the process, a knowledge base of strong connections between 
terms or sets of terms is defined. Our assumption is that: 
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Hypothesis 2 ; For a term tj, its context in a collection can be induced from 
the set of terms C connected with tj. C allows us to discover the use tendency 
Oftj. 

3.2 Association Rules in IR Process 

The association rules method has been used for create a knowledge base of strong 
connections between terms. Several scenarios to exploit this knowledge in an IR 
system can be proposed: 

— Automatic expansion of the query: queries are automatically expanded using 
the connections between terms, without pre-established knowledge. 

— Interactive interface: an interactive interface helps users formulate their 
queries. The user can be heavily guided during the process of query for- 
mulation using connections between terms. This method is the one most 
frequently used by the systems cited in Section 2, where the user specifies 
the thresholds for support, confidence, and interest measures. 

— Graphical interface: a graphical interface, like Refine of AltaVista [6], allows 
the user to refine and to redirect a query using terms related to the query 
terms. 

We have focused our experimentation on the first proposal. Its main charac- 
teristic is that it does not need established knowledge, which runs counter to the 
IR tendency where an established domain-specific knowledge base [7] or a general 
world knowledge base like Wordnet [29] is used to expand queries automatically. 
Our approach is based only on knowledge discovered from the collection. 

4 Evaluation of an IR System on the Web 

Classically, two criteria are used to evaluate the performance of an IR system: 
recall and precision [26]. Recall represents the capacity of a system to find rel- 
evant documents, while precision represents its capacity not to find irrelevant 
documents. To calculate these two measures, we use these two sets: TZ (the set 
of documents retrieved by the system), and V (the set of relevant documents in 
the corpus): 



Recall - II 

- II p II 


(3) 


r. •• ll^npjj 

Prec„,on = 


(4) 



To make a statistical evaluation of system quality, it is necessary to have a 
test collection: typically, a corpus of several thousand documents and a few dozen 
queries, which are associated with relevance judgements (established by experts 
having a great knowledge of the corpus). So for each query we can establish a 
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recall/precision curve (precision function of recall). The average of these curves 
makes it possible to establish a visual profile of system quality. 

In the context of the Web, such an evaluation has many problems: 

— The huge quantity of documents existing on the Web makes it quite difficult 
to determine which are relevant for a given query: how can we give an opinion 
on each document from a collection of 800 million HTML pages? 

— The heterogeneity of Web pages, in terms of content and presentation, makes 
it very difficult to establish a “universal” relevance judgement, valid for each 
page. 

— The Web is dynamic: we estimate [21], [22] that the quantity of documents on 
the Web doubles about every 11 months. Moreover, most pages are regularly 
modified, and more and more of them are dynamic (generated via CGI, ASP, 
etc.). 

— According to Gery [13], it is not satisfactory to evaluate a page only by 
considering its content: it is also important to consider the accessible in- 
formation: i.e. the information space for which this page is a good starting 
point. 

For example, a Web page that only contains links will contain very little 
relevant information, because of the few relevant terms in its content. In this 
case, the document relevance will be considered low. But if this list of links 
is a “hot-list” of good relevant references, then this page may be considered 
by a user to be very relevant. 

— A criticism related to the binary relevance judgement, which is generally 
used in traditional test collections. This problem is a well-known problem in 
the IR field and we agree with Greisdorf [14], saying that it is interesting to 
use a nonbinary relevance judgement. 

In this context, establishing a test collection (i.e. a corpus of documents, a 
set of queries, and associated relevance judgements) seems to be a very difficult 
task. It is almost impossible to calculate recall, because it needs to detect all the 
relevant pages for a query. On the other hand, it is possible to calculate precision 
at n pages: when an IR system gives n Web pages in response to a query, a judge 
consults each one of these pages to put forward a relevance judgement. In this 
way, we know which are the p most relevant pages (TZp) for a query according 
to a judge. 



n documents precision 



II II 

n 



( 5 ) 



There do exist such test collections for the Web, created as part of the TREC 
(Text Retrieval Conference) Web Track [17]. Two samples of respectively 1 giga- 
byte (GB) and 10 GB were extracted from a 100 GB corpus of Web documents. 
A set of queries similar to those used in a more traditional context was used 
with these collections, with several TREC systems and five well-known search 
engines. Judges performed relevance judgements for these results. So it is possible 
to evaluate these systems, using for example precision at 20 documents. 
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One important criticism to this approach is the non-dynamic aspect; the 
Web snapshot is frozen. Since the time when the pages were originally gathered, 
some pages may have been modified or moved. Moreover, adding the few months’ 
delay for setting up relevance judgements, we can say that this test collection 
is not very current. But the most important criticism we have to make deals 
with the nonbinary relevance judgement, and moreover the hypertextual aspect 
of the Web which is not considered by this approach. 

A similar approach, by Chakrabarti [8], uses a method called comparative 
precision to evaluate its Clever system against Altavista (www.altavista.com) 
and Yahoo! (www.yahoo.com). In this case, the evaluation objective is only the 
comparison between several systems, in the current state of their index and 
ranking function. But Chakrabarti has not created a test collection reusable 
for the evaluation of other systems. In [8], 26 queries were used; the 10 best 
pages returned by Clever, Altavista, and Yahoo! for each one of these queries 
were presented to judges for evaluation. This method has the advantage of being 
relatively similar to a real IR task on the Web. Indeed, the judges can consider 
all the parameters which they believe important: contents of the document, 
presentation, accessible information space, etc. But the queries used seem to be 
too short and not precise enough, even in this context where queries are generally 
short (1 or 2 terms) and imprecise. It may be difficult to judge documents with 
a query such as “cheese” or “blues”, and it seems to be difficult to evaluate 
an IR system with such a set of queries. But the principal disadvantage of this 
method of comparative precision is that it is almost impossible to differentiate 
the quality increase/ decrease of a system’s results due to: 

— The process of creation of the Web page corpus, i.e. if the search engine 
index has a good coverage of the whole Web, if there is a small number of 
redundant pages, etc. 

— The indexing, i.e. the manner of extracting the semantic contents from the 
documents. For example, it is difficult to evaluate if indexing the totality of 
terms is a good thing. Perhaps is it as effective to index only a subset of 
terms carefully selected according to various parameters. 

— The page ranking using a matching function, which calculates the relevance 
of a page to a query. 

— The whole query processing. 

In spite of these disadvantages, this method permits us to evaluate the overall 
quality of a system. Thus, we have chosen it for evaluating our system in the next 
section, while keeping in mind these disadvantages and trying where possible to 
alleviate them. 



5 Our Experiments on the Web 

For the evaluation of our method’s contribution to IR process, we have carried 
out an experiment on a corpus of Web pages, consisting of several steps; 
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— Collect Web pages: it is necessary to have a robot (or spider), traverse the 
Web in order to constitute a corpus of HTML pages. 

— Indexing: we have used the SMART system [26], based on a Vector Space 
Model (VSM) for document indexing. 

— Query expansion: application of the association rules method to generate a 
knowledge base of connections between terms, used for expanding queries. 

— Querying: it is necessary to interface the various search engines in order 
to be able to submit queries to each of them and to collect the Web-page 
references proposed. 

— Evaluation: these references must be merged to present them to judges who 
will assign a relevance judgement to eeich one. 



5.1 A Corpus of HTML Web Pages 

We have developed a robot called CLIPS-Index, with the aim to create a signifi- 
cant corpus of Web pages. This spider crawls the Web, storing pages in a precise 
format, which describe information like content, outlinks, title, author, etc. An 
objective is to collect the greatest amount of information we can in this very 
heterogeneous context, which is not respectful of the existing standard. Indeed, 
the HTML standard is respected in less than 7% of Web pages [4|. Thus, it is 
very difficult to gather every page of the Web. In spite of this, our spider is quite 
efficient: for example we have collected more than 1.5 million pages on the .fr 
domain, compared to Altavista which indexes only 1.2 million pages on the same 
domain. 

Moreover, CLIPS-Index must crawl this huge hypertext without considering 
non-textual pages, and with respect to the robot exclusion protocol [20]. It is 
careful not to overload remote Web servers, despite launching several hundred 
HTTP queries simultaneously. This spider running on an ordinary 300MHz PC, 
which costs less than 1,000 dollars, is able to find, load, analyze, and store about 
^ million pages per day. 

For this evaluation, we have not used the whole corpora of 1.5 million “.fr” 
HTML pages, because of the various difficulties to process 10 GB of textual 
data. So we have chosen to restrict our experiment to Web pages of the IMAG 
(Institut d’lnformatique et de Mathematiques Appliquees de Grenoble) domain, 
which are browsable starting from www . imag . f r. The main characteristics of this 
collection are the following: 

— Size (HTML format): about 60,000 HTML pages which are identified by 
URL, for a volume of 415 MB. The considered hosts are not geographically 
far from each other. The gathering is thus relatively fast. 

— Size (textual format): after analysis and textual extraction from these pages, 
there remain about 190 MB of textual data. 

— Hosts: these pages are from 37 distinct hosts. 

— Page format: most of the pages are in HTML format. However, we have 
collected some pure textual pages. 




Automatic Query Expansion 343 



— Topics; these pages deal with several topics, but most of them are scientific 
documents, particularly in the computer science field. 

— Terms: more than 150,000 distinct terms exist in this corpus. 



5.2 Indexing 

For representing the documents’ content model, we have used one of the most 
popular models in IR: the Vector Space Model [26] , which allows term weighting 
and gives very good results. A document’s content is represented by a vector in 
an N-dimensional space: 

Contenti = (wn, Wi 2 ... Wij ... Wi„ ) (6) 

Where Wij G [0, 1] is the weight of term tj in document Dj. This weight is 
calculated using a weighting function that considers several parameters, like the 
term frequency tfij , the number of documents in which the term tj appears dfj 
{document frequency) and the number of documents Ndoc in the corpus: 

Wij = {log2{tf^j) + 1) * log2{^^) (7) 

A filtering process preceding the indexing step makes it possible to eliminate 
a part of the omnipresent noise on the Web, with the aim of keeping only the 
significant terms to represent semantic document content. Several thresholds are 
used to eliminate extremely rare terms (generally corresponding to a typing error 
or a succession of characters without significance), very frequent terms (regarded 
as being meaningless), short or long terms, or for each document’s terms, those 
with a very low weight. The filtered corpus thus obtained has a reduced size 
(approximately 100 MB) and it retains only a fraction of the terms originally 
present (approximately 70,000 distinct terms). 



5.3 The Association Rules Process 

First we have to apply a small amount of pre-processing to the collection. In our 
experiments, we have chosen to deal only with substantives, i.e. terms referring to 
an object (material or not), because we believe that substantives contain most of 
the semantic of a document. These terms should be extracted from documents by 
the use of a morphologic analysis. Then, we apply the association rules method 
to obtain a knowledge base. We have used very restrictive thresholds for support 
and confidence to discover the strongest connections between terms. For each 
term, we discovered about 4 related terms. Every query is extended using terms 
in relation with the query’s terms. 

A query interface was developed for each system’s interrogation. 

— Queries are specified, either in a file or on-line. In this last case, the query- 
ing looks like a meta-search engine (Savvysearch, www. savvysearch. com or 
Metacrawler www .metacrawler . com). 
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- Automatically sending an HTTP query specifying the information needed: 
query terms, restriction fields, etc. The information expression is different 
for each engine; our system is currently able to question a dozen of the most 
popular search engines. Unfortunately, all these systems do not offer the 
same functionality: for example, several of them do not make it possible to 
restrict an interrogation to a Web sub-domain. 

- Analyzing result pages provided by the system. The analysis is different for 
each engine. 

- Automatic production of an HTML page presenting the results. To avoid bias 
toward a particular search engine, the references suggested are identified 
only by their URL’s and titles. The remainder of the information usually 
proposed by search engines (summary, size, date of modification, etc) are 
not displayed. The order in which references appear is chosen randomly. 
When several engines propose the same page, it appears only once in our 
results. 



5.4 Evaluation 

For an initial evaluation of our system, we defined four queries that were rather 
general but that related to topics found in many pages of the corpus. For a more 
precise description of the user’s need, we presented in detail the query specifica- 
tions to the judges. Eight judges (professors and Ph.D. students) participated in 
the evaluation. Three systems were evaluated: Altavista, HotBot, and our sys- 
tem. We asked judges to give a relevance score for each page: 0 for an irrelevant 
page, 1 for a partially relevant page, 2 for a correct relevant page, and 3 for 
a very relevant page. We defined the following formula to compute the scores 
allowed by the judges: 



Scares = — ( 8 ) 

nh 

- Scares is the score of the search engine s. 

- d is a document. 

- nb is the number of documents retrieved by search engine s 

- j is a judge. 

- nbj is the number of judges participating in the experiments. 

- jugj is the score attributed by judge j to document d. 

We obtained the following results (see table 1): 



Table 1. Experiment Results 



Altavista Hotbot Our system 
Score 24.99 45.83 34.16 
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These results require much discussion. A more detailed analysis of results 
show that in the case of imprecise queries, our system gives the best results. 
In other cases, Hotbot has the best scores. For exemple, the ten documents 
retrieved by our system for the query “recherche d ’informations multimedia” 
seven of them have relevance score 3, two score 2, and the last scores 0. For the 
same query, the ten pages of Altavista score 0, while four pages of Hotbot score 
2 and the others score 1. For this first experiment, we obtained good results 
against Altavista using rough parameter settings and a small number of queries, 
which is not enough to be significant. We will try now to improve our method 
and study the different parameters. We will also exercise our system with a 
significant quantity of queries and judges. 



6 Conclusion 

Our approach is relatively easy to implement and is not expensive if the algo- 
rithm is well designed. It is able to use existing search engines to avoid reindexing 
the whole Web. If both data mining and IR techniques were effectively merged, 
much of the existing knowledge could be used in an IR system. 

We think our experiments are insufficient to validate this approach in the 
context of an IR system. The conclusion we can have is that the use of ad- 
ditional knowledge may increase the quality in terms of recall and precision; 
we have showed that connections between terms discovered using association 
rules could improve retrieval performance in an IR system, without background 
knowledge. But many considerations should still be studied. The processes of 
stemming and morphologic analysis are very important. We are studying how 
to automatically establish thresholds of the measures used and how to label 
connections. Discovering connections between documents using association rules 
appears worthwhile. We plan to introduce a graphical interface, and beyond that, 
an additional linguistic process in order to deal with proper names and dates. 
More than by quantitative evaluation, we are interested in qualitative evaluation 
of the extracted terms, connections between terms, and the organization of the 
knowledge base. 
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Abstract. In this paper we present KnowCat, a non-supervised eind dis- 
tributed system for structuring knowledge. KnowCat stands for “Knowl- 
edge Catalyzer” and its purpose is enabling the crystallization of collec- 
tive knowledge as the result of user interactions. When new knowledge is 
added to the Web, KnowCat assigns to it a low degree of crystallization; 
the knowledge is fluid. When knowledge is used, it may achieve higher 
or lower crystallization degrees, depending on the patterns of its usage. 
If some piece of knowledge is not consulted or is judged to be poor by 
the people who have consulted it, this knowledge will not achieve higher 
crystallization degrees and will eventually disappear. If some piece of 
knowledge is frequently consulted and is judged as appropriate by the 
people who have consulted it, this knowledge will crystallize and will be 
highlighted as more relevant. 



1 What Kind Of Knowledge Are We Dealing With? 

Knowledge is a heavily overloaded word. It means different things for differ- 
ent people. Therefore we think that we need to make clear what definition of 
knowledge we are using in the context of KnowCat. 

One possible definition of knowledge is given in the Collins dictionary: “the 
facts or experiences known by a person or group of people”. This definition, 
however, is too general and needs some refinement to become operational. In 
fact, we need some classification of the different types of knowledge that exist. 
Probably, each of them will require different treatments and different tools to be 
supported by a computer system. 

Quinn, Andersen, and Findelstein [5] classify knowledge in four levels. These 
levels are cognitive knowledge (know- what), advanced skills (know-how), systems 
understanding (know- why), and self-motivated creativity (care- why). Prom this 
point of view, KnowCat is a tool mainly involved with the first level of knowledge 
(know-what) and secondarily related with the second level of knowledge (know- 
how). KnowCat tries to build a repository (a group memory) that contains the 
consensus on the know-what of the group (and perhaps some indications on the 
know-how) . 

According with Polanyi [4] a meaningful classification of knowledge types 
may be constructed in terms of its grade of “explicitness” : we may distinguish 
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between tacit knowledge and explicit knowledge. Tacit knowledge resides in peo- 
ple and is difficult to formalize. On the other hand explicit knowledge can be 
transmitted from one person to other through documents, images and other 
elements. Therefore, the possibility of formalization is the central attribute in 
this classification. KnowCat deals with explicit knowledge: the atomic knowledge 
elements are Web documents. 

Allee [1] proposes a classification of knowledge in several levels: data, infor- 
mation, knowledge, meaning, philosophy, wisdom and union. Lower levels are 
more related to external data, while higher levels are more related to people, 
their beliefs and values. KnowCat deals with the lower half of these levels, al- 
though it may be used to support some of the higher levels. 

Finally, there is an active discussion on whether knowledge is an object or 
a process. Knowledge could be considered a union of the things that have been 
learned, so knowledge could be compared to an object. Or knowledge could be 
considered the process of sharing, creating, adapting, learning and communicat- 
ing. KnowCat deals with both aspects of knowledge, but considers that “knowl- 
edge as a process” is just a means for accomplishing the goal of crystallizing 
some “knowledge as an object”. 

Prom the above discussion, we can summarize the following characteristics 
about the type of knowledge KnowCat deals with: 



— KnowCat deals with “explicit” knowledge. 

— Knowledge consists in facts (know-what) and secondarily in processes (know- 
how). 

— These facts are produced as “knowledge-as-an-object” results by means of 
“knowledge-as-a-process” activities. 

— KnowCat deals with stable knowledge: the group memory is incrementally 
constructed and crystallized by use. The intrinsic dynamics of the underlying 
knowledge should be several orders of magnitude slower than the dynamics 
of the crystallization mechanisms. KnowCat knowledge may be compared 
to the knowledge currently stored in encyclopedias. The main difference of 
the results of KnowCat with respect to a specialized encyclopedia is the dis- 
tributed and non-supervised mechanism used to construct it. In particular, 
KnowCat allows the coexistence of knowledge in several grades of crystal- 
lization. 

— KnowCat knowledge is heavily structured: it takes the form of a tree. 

Getting into the details, KnowCat stores knowledge in the form of a knowl- 
edge tree. Each node in the tree corresponds to a topic, and in any moment there 
are several “articles” competing for being considered the established description 
of the topic. Each topic is recursively partitioned into other topics. Knowledge 
is contained in the description of the topics (associated with a crystallization 
degree) and in the structure of the tree (the structure in itself contains explicit 
knowledge about the relationships among topics). 
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2 Knowledge Crystallization 

KnowCat is a groupware tool [2] with the main goal of enabling and encouraging 
the crystallization of knowledge from the Web. But what do we mean by crys- 
tallization? We conceptualize the Web as a repository where hundreds of people 
publish every-day pieces of information and knowledge that they believe to be 
of interest for the broader community. This knowledge may have a short span 
of life (e.g. a weather forecast), or it may be useful for a longer time period. In 
either case, recently published knowledge is in a “fluid” state: it may change or 
disappear very quickly. We would expect that, after some period of time, part 
of this knowledge ceases to be useful and is eliminated from the Web, part of 
this knowledge evolves and is converted into new pieces of knowledge, and part 
of this knowledge achieves stability and is recognized by the community as “es- 
tablished” knowledge. Unfortunately, established knowledge is not the typical 
case. 

KnowCat tries to enrich the Web by contributing the notion of a “crystalliza- 
tion degree” of knowledge. When new knowledge is added to the Web, KnowCat 
assigns to it a low degree of crystallization; i.e. new knowledge is fluid. When 
a piece of knowledge is used, it may achieve a higher or lower crystallization 
degree, depending on the patterns of its usage. If some piece of knowledge is 
not consulted or is judged to be poor by the people who have consulted it, this 
knowledge will not achieve a higher crystallization degree and will eventually 
disappear. If some piece of knowledge is frequently consulted and is judged as 
appropriate by the people who have consulted it, this knowledge will crystal- 
lize; it will be highlighted as relevant and it will not disappear easily. In each 
moment there will be a mixture of fluid and crystallized pieces of knowledge 
in the system, the former striving for recognition, the latter being available as 
“established” knowledge. 

Knowledge crystallization is a function of time, use, and opinions of users. 
If some piece of knowledge survives a long time, is broadly used, or receives 
favorable opinions from its users, then its crystallization degree is promoted. 
However, as important as the number of users or their opinions is the “quality” 
of these users. We would like to give more credibility to the opinions of experts 
than to the opinions of occasional users. But how can the system distinguish 
between experts and novices? 

KnowCat tries to establish such categories by the same means that the sci- 
entific community establishes its members’ credibility: taking into account past 
contributions. Only accredited “experts” in a given topic can vote for or against 
a new contribution. To get an “expert” credential at least one of a user’s contri- 
butions must be accepted by the already-established experts in the community. 
In a sense, this mechanism is similar to the peer review mechanism that is widely 
accepted in academia: senior scientists are requested to judge new contributions 
to the discipline, and the way for people to achieve seniority is to have their own 
contributions accepted. 

This mechanism is closely related to one of the central aspects of KnowCat: 
the management of “virtual communities” [3]. But before discussing the details 
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of such management, we need to describe the KnowCat knowledge structure. 
KnowCat is implemented eis a Web server. Each KnowCat server represents a 
main topic, which we call the root topic of the server. This root topic is the root 
node of a knowledge tree. KnowCat maintains a meta-structure of knowledge 
that is essentially a classification tree composed of nodes. Each node represents 
a topic, and contains two items: 

- A set of mutually alternative descriptions of the topic; a set of addresses of 
Web documents with such descriptions. 

— A refinement of the topic: a list of other KnowCat nodes. These nodes can 
be considered the subjects or refinement topics of the current topic. The list 
shapes the structure of the knowledge tree recursively. 

For example, we may have a root node with the topic “Uncertain reasoning” . 
We may have four or five “articles” giving an introduction to this discipline. 
Then we may have a list of four “subjects” for this topic: “Bayesian Networks” , 
“Fuzzy Logic” , “Certainty Factors” and “Dempster-Shafer theory” . These sub- 
jects in turn are nodes of the knowledge tree, with associated articles and further 
refinements in terms of other subjects. Among the articles that describe the node 
(topic), some are crystallized (with different degrees), others are fluid (just ar- 
rived) and others are in the process of being discarded. All of them are in any 
case competing with each other for being considered as the “established” paper 
on the topic. 

Virtual communities of experts are constructed in terms of this knowledge 
tree. For each node (topic), the community of experts in this topic is composed 
of the authors of the crystallized papers on the topic, on the parent of the topic, 
on any of the children of the topic, or on any of the siblings of the topic. There is 
a virtual community for each node of the tree, and any successful author usually 
belongs to several related communities. 

The mechanism of knowledge crystallization is based in these virtual commu- 
nities. When one of your contributions crystallizes, you receive a certain amount 
of votes that you may apply for the crystallization of other articles (of other 
authors) in the virtual community where your crystallized paper is located. In 
turn, your contributions crystallize due to the support of the votes of other 
members of the community. Although peer votes are the most important factor 
for crystallization, other factors are also taken into account, such as the span of 
time that the document has survived, and the amount of “consultations” it has 
received. If a document does not crystallize, after some time it is removed from 
the node. 

The other aspect of knowledge crystallization is the evolution of a tree’s 
structure. Any member of a virtual community may propose to add a new subject 
to a topic, to remove a subject from a topic, or to move a subject from one topic 
to another topic. Once proposed, a quorum of positive votes from other members 
of the community is needed for the change in the structure being accepted. All 
community members may vote (positively or negatively) without expending any 
of their acquired votes. KnowCat supports some elemental group communication 
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protocols to allow discussion on the adequacy of the proposed change, in case it 
is needed. 

The two mechanisms above described for crystallization of topic contents 
and tree structure apply in the case of mature and active virtual communities. 
However, virtual communities behave in a different way when they are just be- 
ginning, and also (possibly) in their last days. KnowCat proposes a maturation 
process that involves several phases. Rules for crystallization (and other rules) 
may be different for each phase. Figure 1 shows this evolution. 



NON-EXISTENT NODE 



O A new node is created as a root node or as a 
"subject” (child) of another node. 



A member of the community may 
suggest the creation of a new node. It has 
to be approved by a majority of the 
members. 



SUPERVISED NODE 



The steering committee may decide to 
promote the node to the Active stage. 




ACTIVE NODE 



After some time, many of the members of 
the original community cease to be 
active. Changes and new contributions 
are rare . 



STABLE NODE 




At this stage there wiU be a "steering 
committee" which wil! decide on the way 
knowledge is structured, it will receive users ' 
opirvons and it will solve all matters. This is a 
supervised stage. 



The community of an active node may decide 
to return to a supervised stage to engage in a 
process of re- structuration 



An active node shows a lot of activity in the 
conterts of the node. However, the structure 
(list of r^nement subjects) is much more 
stable. There is no steering committee. 



Contribution rate increases. There are many 
active community members again. 



There are very few changes, the node is very 
stable. 



Fig. 1. Virtual community modes 

Normal crystallization mechanisms apply to active nodes (and associated 
virtual communities). However, these mechanisms require a minimum amount 
of activity in the node to obtain reasonable results. This minimum might not be 
achievable at the beginning or at the end of a node’s life cycle. 
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When a node is created (especially when it is created as the root node of 
a new KnowCat server), there may not be many accredited experts to form 
the virtual community. For some time (that in the case of new nodes hanging 
from established knowledge trees may be elapsed to zero) the node will work 
in the supervised mode. During this supervised phase there will be a steering 
committee in charge of many of the decisions that will be made in a distributed 
way in later phases. In particular, all members of the steering committee are 
considered experts on the node and have an infinite amount of votes to expend on 
accepting or rejecting contributions. Radical changes in the structure of subjects 
are possible by consensus of the steering committee. In addition, one important 
task of the steering committee is to motivate the community of a new node to 
participate and to achieve enough size to become an active community. Members 
of the steering committee are defined when a new node is created. New members 
can be added by consensus of current members. 

A steering committee may decide to advance a node to active status. At this 
point the committee is dissolved and crystallization of the knowledge is carried 
out as explained before. In exceptional cases, a community of experts may decide 
(and vote) to return to supervised mode. This probably will be motivated by the 
need to make radical changes in the structure of subjects. Active communities 
can only change subjects by adding or deleting subjects one at a time. Supervised 
communities may engage in more complex and global structural changes. 

Finally, an active community may reax;h the stable phase. Many of the com- 
munity members are no longer active, so different rules should be applied to 
ensure some continuity of the crystallization. Changes are rare, and most of the 
activity is consultation. Few new contributions arrive, and they will have much 
more difficulty to crystallize than in the active phase. However, if activity raises 
to a minimum, the node may switch to active status again and engage in a new 
crystallization phase. 

3 Implementation 

We currently have an operative prototype of the KnowCat system, which is a 
Web-based client-server application. Each KnowCat server can be considered to 
be the root of a tree structure representing knowledge. This structure and all its 
associated data (contributions, authors, votes, etc.) are stored in a database. 

Knowledge contributions (documents) are not a part of KnowCat. Generat- 
ing new knowledge (for example, a paper about “Security in Windows NT” ) is 
a process that is not managed by the tool. KnowCat presumes that any con- 
tribution is a WWW (World Wide Web) document located somewhere in the 
Web, accessible via a standard URL. The KnowCat knowledge tree stores URL 
references to such documents. 

Users connecting to a KnowCat node can choose between several operations: 
adding a document, voting on documents, proposing subjects of a topic, and 
voting on subjects. KnowCat servers build Web pages in a quasi-dynamic way. 
When an operation is realized, some data is added to database tables but new 
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crystallization values of system elements are not calculated at this time. Crys- 
tallization is performed periodically and asynchronously. KnowCat servers build 
Web pages dynamically, using information stored in the database. 

We now present some of the screens a user would see while working with 
KnowCat. Figure 2 shows a sample screen shot. 



Unceitain Reasonir>g in http://knowcal.ii.uam es/reasoning/. ■ Miciosott Inteinet Exploiei 



^(duvo £dbi6n ha favotika Ayyda 



HE3C3 






Oieccibn | hUp7/knowcat.iuam,e!Aeasor«ng/kc.asp?T«1 






TO PROPOSE- 10 VOTE, , 




S' 

4 - 



Uncertain Reasoning 


Uacert&U RetMniiig 


• Silvia P eiez Femandei 13/28/99 - 9 : 1 4 28 AMI 

• Rosa Martin Salas 13/23/99- PM1 


• Bayesian Netwoiks 
• Certainty Factors 
• DenrasUr-ShaferTheorv 
• Methods based on Fuziy Sets 




Theory 


. Ivan Anas CaUe 14/3/99 -236:0(5 PMl 





Fig. 2. Root node of a KnowCat knowledge tree 



Figure 2 shows the root node of a knowledge tree that was produced in 
one of our experiments (see Section 4). The screen is divided in two parts. The 
left side shows documents that have been submitted on the topic of “Uncertain 
Reasoning” . An author name and timestamp identify these contributions. The 
first two documents in Figure 2 have already crystallized, but the third document 
is still fluid. Crystallized documents are ordered by crystallization grade. 

To visit a document (for example “Rosa Martin Salas [3/23/99 - 5:59:41 
PM]”), the user clicks on its link and the corresponding document content (lo- 
cated somewhere in the Web) is displayed. 

The right side of the screen in Figure 2 shows the refinement of the “Uncertain 
Reasoning” topic. Below the arrow (“Next Subjects”) are subjects that refine 
the current topic. An associated knowledge node may be visited by clicking on 
the desired subject. For example, if we select the subject “Methods based on 
Fuzzy Sets Theory” , the screen shown in Figure 3 would be displayed. 
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In the lower-right corner of Figure 3 we see that a “Fuzzy Measures” sub- 
ject has been proposed for adding to the refinement list of the current node. 
By selecting the option “TO VOTE... ADD SUBJECT”, any authorized user 
(a member in the steering committee of a supervised node or a member of the 
virtual community otherwise) may contribute to this proposed modification be- 
ing accepted. If someone proposes to remove a subject, it will appear below the 
previous section under the heading “Subjects Proposed to be Removed.” 



'3 Methods based on Fuzzy Sets Theoiy in http://knowcal.ii.udfn.es/reasoning/. - Miciosoft Internet... HtJ D 
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Methods based on Fiu^ Sets Theory 



• Felipe Hcr&s Oaicift (4/16^99 - 12 54.24 PMl 



■ Bona Sara Tome |~4/7/99 ■ 3.41 40 PM~| 

• Sdvia P^rez Femdndez [4/25/99 - 3:49.50 AM] 



■ Uncertain Reagomng 




Methods boied oa F«ziy Theory 




■ Fiirry T ngir 

• FuzzySets 



Subjects Proposed to be Added: 
■ Fusy Measures 



Fig. 3. Refinement node visited from the root node 



In any node, authorized users may use the options “VOTE DOCUMENT” 
to vote (positively or negatively) on one of the documents of the node. Any user 
may contribute to the system by means of the “ADD DOCUMENT” option. 



4 Experimental Results 

We have performed three experiments with KnowCat. The first experiment was 
done with third-year students enrolled in an advanced operating systems course 
in the Computer Science Department of the Universidad Autonoma de Madrid. 

At the beginning, a KnowCat node was created on the topic “Operating 
Systems.” An initial structure of six subjects was devised. We wanted to check 
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the “contents” part of the tool. Collaboration was voluntary, and only 15 of 
the 80 students enrolled in the course participated. Of these 15, 7 contributed 
regularly. 

Students performed several operations: 

— 27% of operations were adding documents to the system, 

— 23% were voting on documents, 

— 14% were proposing initial-topic subjects, and 

— 36% were voting on proposed refinements. 

This experiment served as a test bed for some of our initial ideas, and sev- 
eral of the developments presented in this paper were devised after our initial 
findings. Not surprisingly, one of the major problems we faced was to motivate 
the participation of the students (contributing and voting). The results of the 
interaction were of moderate quality in this first experiment. 

The second experiment involved students of a graduate Computer Science 
course on “Uncertain Reasoning” at our university. For this experiment we 
wanted to test the “structure” part of KnowCat, so we created a root node 
(“Uncertain Reasoning”) with no initial topics. This course is delivered every 
year and we will continue this experiment for several more years. About 10 to 
15 graduate students enroll each year. The aim of the experiment during the 
first year (this year) was to check the feasibility of the students making a good 
structure for the topic by using our proposed voting mechanism. 

At this point we can summarize the experiment progress: 

— Most of the operations performed by students were related to the refinement 
of topics. 41% of these operations consisted of proposing new subjects; 39% 
were student opinions on new subjects; only 13% consisted of adding docu- 
ments to the system; and the remaining 7% consisted of votes on documents. 

— There are 14 topics in the KnowCat node and they are distributed over a 
tree four levels deep. So the structure of the node has evolved very well. The 
course instructor judges the quality of the structure to be satisfactory. He 
did not interfere directly in the process of defining the structure. 

— Student participation has been fairly uniformly distributed in time. However, 
there were several periods during which student participation was markedly 
greater. These coincided with student communication by e-mail. For exam- 
ple, when students noticed that they should discuss the correct location of 
a refinement, they used e-mail communication for the discussion. 

— During the quarter (3 months) e^lch student participated 10 times on aver- 
age, achieving a much higher degree of participation than in the previous 
experiment. The graduate status of the students in the second experiment, 
as well as the more controversial nature of the process of deciding the most 
adequate structure of a topic, are likely reasons for the higher participation. 

— Students spent a lot of time consulting documents contributed by other 
students. This suggests that the system is really useful as a repository of 
structured knowledge. 
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Another interesting design aspect has arisen as a result of this experiment. 
In the current version of KnowCat, when a user proposes a new subject for a 
topic s/he only has to specify the new topic’s name. Sometimes just the name 
is not significant enough to understand the intent of the proposal, and therefore 
discussion on the adequacy of the proposal is not easy. This issue will be solved 
in future versions of the tool by allowing the inclusion of commentaries with the 
initial proposal and subsequent votes. 

Finally, the third experiment involved students at our university registered in 
a preliminary operating systems course. This experiment started with the root 
node created in the first experiment. The 200 students (two concurrent classes) 
were assigned to 12 predefined subjects (16 students per topic). Each student was 
assigned to produce a small paper on the assigned topic and vote for the three 
best papers in that same topic. We wanted to check the hypothesis that when 
you get enough contributions and enough votes from “knowledgeable” peers, 
the result is a reasonable description of the topic. The instructor graded papers 
independently, and this grading was used to check the adequacy of the voting 
system to capture the quality of the papers. In this case student motivation was 
achieved by grading the students on the quality of their papers and in the quality 
of their votes (that is, in their judgement capabilities on the topic, as compared 
with the opinion of the teacher). 

The results of this experiment were very encouraging. In 11 out of the 12 
topics the votes of the students converged to a small set of papers. There was 
a remarkable consensus. For most topics the two most popular papers collected 
50% of the total votes (the maximum was 66% since each student issued three 
votes for three different papers). Also, the four most popular papers always 
collected 75% of the total votes. Only in one topic was there a dispersion in 
the votes, probably due to a more uniform distribution of the quality of papers 
submitted. 

Furthermore, in 10 out of the 12 topics at least two of the three papers 
selected by the instructor as “the three best papers” were also selected by the 
students as such (in 3 cases the two sets of “the three best papers” were identical). 
In 7 topics the best paper, in the opinion of the instructor, was selected by the 
students as one of the three best. There was one topic where the opinion of the 
instructor was very different from the opinion of the students, but this case was 
also the one where the votes were most dispersed: there were no clear “winners.” 
Although these results have to be verified with further experiments (next 
year this experiment will be repeated with other students), we think that they 
already provide some support for two of the hypotheses that underlie the design 
of KnowCat: 

— If a set of “knowledgeable” people engage in a reasonable interaction with 
our system, the result converges to some consensus. 

— This consensus is closely related to some objective measure of “quality” of 
the contributions. 

Of course, these results have to be pondered in the light of our experimental 
setting, which has possibly introduced some artifacts: 
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— Motivation aspects were intentionally omitted from the experiment. Grading 
policies furnished sufficient external motivation to the students. In particular 
they were encouraged to vote after a careful reflection on the quality of all 
the papers. It is not clear whether the obtained consensus would have been 
achieved without this external motivation. 

— The group of “knowledgeable” people was very special in nature. All were 
novices in the topic at the beginning of the course, and they acquired their 
expertise by attending the same lectures and by reading the same reasonably 
sized but finite bibliography. 

— The “objective” measurement of the quality of the papers was done by the 
same person that had lectured them during the course. 

During the next year we are going to engage in further experiments trying 
to correct these artifacts by changing the experiment conditions. 



5 Conclusions And Future Work 

In this paper we present a new tool for structuring knowledge in the Web based 
on the concept of “crystallization.” A first prototype of the tool is currently in 
operation. Some of the findings in our initial experiments have been included as 
new features of the tool. However, we have identified several areas where further 
improvement and experimentation are needed. 

KnowCat will be a distributed and scalable system. It should work with 
stand-alone servers (“knowledge islands”) and also with combinations of these 
“islands” into higher level structures. Inter-server protocols must be developed 
for this purpose. Several new problems appear in this scenario, such as the 
possible duplication of nodes (perhaps with slightly different names), problems 
with the ownership of the joint structure, and difficulties in structure changes 
that affect two servers simultaneously. 

We currently propose a tree structure for the knowledge tree. However, it may 
be the case that a subject is the refinement of more than one topic. Perhaps 
the tree structure should be transformed into a generic graph structure. The 
compromise is between efficiency, clarity, and maintainability (tree structure) 
and expressiveness (graph structure). 

There are several open questions with respect the crystallization mechanism. 
For example, some enforcement of the “fairness” of the voting mechanism is 
necessary. The votes related to such crystallization procedures will be public. 
Anyone will be able to research the voting structure and locate recurrent “voting 
cycles” or other artifacts. The dynamics of the voting mechanism also presents 
several open research issues. Should the system be inflationary on the number of 
votes available? When an article de-crystallizes, what should be done with the 
author’s votes obtained through the article for him? 

At the moment we have only tested KnowCat on university courses, and we 
have noticed that this system is really useful for motivating students in sharing 
their knowledge and incrementally constructing a knowledge repository that will 
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improve over the years. However, we expect our tool to show some of its more 
innovative characteristics in the environment of research groups interested in 
sharing knowledge. During the next year several experiments will be carried out 
in this direction. 



Acknowledgements. This work has been partially supported by the Comu- 
nidad Autonoma de Madrid (project number 07T/0027/1998) and by CICYT 
(project number TIC98-0247-C02-02). 



References 

1. Allee, V.: The Knowledge Evolution. Butterworth Heinemann, Boston, (1997) 

2. Coleman, D., Bock, G.: Groupware: Technology and Applications. Prentice Hall, 
Upper Saddle River, NJ (1997) 

3. Hill, W., Stead, L., Rosenstein, M. Purneis, G.: Recommending and Evaluating 
Choices in a Virtual Community of Use. Proc. CHI95, ACM Press, New York, 
(1995) 194-201. 

4. Polanyi, M.: The Tacit Dimension. Routledge and Kegan Paul, London (1966) 

5. Quinn, J. B., Andersen, P., Findelstein, S.: Managing Professional Intellect: Making 
the Most of the Best. Harvard Business Review, 74(1996) 198-205 




Meta-modeling for Web-Based Teachware 
Management 



Christian Siifi^, Burkhard Preitag^, and Peter Brossler^ 

^ Fakultat fur Mathematik und Informatik, 
Universitat Passau, D-94030 Passau, Germany, 
{suess . f reitag}0f mi . imi-passau . de, 

WWW home page; http ; //daisy . f mi . uni-passau . de/ 

^ software design & management, 

D-81737 Miinchen, Germany, 
peter . broesslerSsdm . de, 

WWW home page: http://www.sdm.de/ 



Abstract. In this paper we propose a meta-modeling approach to adap- 
tive hypermedia-based electronic teachware that focusses on document 
structures and navigational services £ind which is also applicable to knowl- 
edge management. 

An abstract meta-model is presented which is suitable to describe het- 
erogeneous and semi-structured course material from different domains 
of application on the web. As an instance of this generic framework we 
derive a sample model for the domain of teaching computer science. 
Content identification and querying at the meta-level and the use of 
metadata enhance navigation and facilitate adaptive presentation and 
navigation as well as reuse and adaption of existing material to new au- 
diences. Each model can serve as a well defined basis for a corresponding 
XML based learning material markup language {LM^L) representation 
which can be restructured and rendered by XSL style sheets for different 
audiences, layouts, or platforms in web based teaching. 



1 Introduction 

Today, teachware (in this paper also called learning, teaching or course material) 
is frequently provided electronically on the Web in a variety of formats, e.g. as 
a collection of HTML pages, as PDF documents, as MS^ Word documents, or 
as MS PowerPoint presentations, that can essentially only be accessed using the 
corresponding viewers or readers and thus is more or less unstructured. In most 
cases, meta information which could be exploited by a knowledge or teachware 
management system is either entirely missing or individually assigned in a rather 
ad hoc way and only at the coarse granularity of large units. 

Conceptual and navigational modeling of teachware faces interesting chal- 
lenges: On the one hand, learners may vary in their knowledge and interest and 
are seeking individual access to course material by adaptable navigation and 

^ Microsoft is a trademark of Microsoft Corporation 
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presentation of given contents and structures [2]. On the other hand, it is es- 
sential for authors to find and combine existing learning material adapting it to 
new audiences or to share learning material. 

Students taking a database course, for example, may want to get a first 
understanding of B-tree indexes without going very deep into the details. A 
typical question might therefore be: “Which definitions and formal results are 
needed for undergraduate students of computer science to understand B-trees 
from an application point of view?” Similarly, authors who are about to write 
a new course module about this topic and want to reuse existing material may 
ask for suitable figures and animations illustrating the formal material. 

As the example above indicates, both types of users are looking for conve- 
nient navigation capabilities as well as for support of complex queries. In addi- 
tion, authors should be supported by (re)structuring mechanisms, source (code) 
control, and version control. These similarities between software engineering and 
the development of learning material call for support of the life-cycle of learning 
material by a sound methodology, in particular with respect to the conceptual 
modeling. 

In this paper we propose a meta-modeling approach to adaptive hypermedia- 
based electronic teaching material^ which allows to describe knowledge about 
aspects of the contents of course material as well as navigational aspects. 

In our approach each teaching domain has an underlying model that describes 
the navigational and the conceptual content structure used for this domain. We 
propose a sample model for the domain of teaching mathematics or computer 
science in which the latter could be given as a sequence of definitions, theorems, 
and examples. Other domains, such as teaching a foreign language, may require 
a different content structure which we will specify in corresponding models. At 
this point it should be noticed that in our approach the document structure is 
modeled rather than the conceptual structure of the subject or area the docu- 
ments are about. This is far more than most of the existing electronic teaching 
material provides but still avoids the huge effort needed to conceptually model 
an entire topic itself. 

To support navigation and retrieval as well as reuse and adaptation, the 
model defines domain-specific properties, i.e. groups of metadata, for (collec- 
tions of) documents, conceptual units and conceptual relationships within. The 
conceptual relationships represented in the model are used as an ancillary struc- 
ture [9] helping to navigate through the underlying material or to ask questions 
like “Which definitions are prerequisite to a given proposition?” thus giving the 
required support for adaptable navigation. Given a model which defines suitable 
conceptual units and their properties, it is possible to integrate material with a 
granularity ranging from single words to entire courses. 

Furthermore, each model serves as a well defined basis for a corresponding 
XML based markup language, e.g. for the learning material markup language 

^ It should be mentioned that, despite our focussing on electronic teaching throughout 
this paper, the techniques described are as well applicable to knowledge management 
in general. 
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(LM^L) [13] which we use to represent university courses in computer science. 
The metadata are used by XSL style sheets for restructuring and rendering the 
corresponding XML representation for different audiences, layouts, or platforms 
in web based teaching. 

Our models serve as sophisticated a-posteriori data schemata that allow to 
apply database technology to web based learning and teaching to improve espe- 
cially the access to documents. 

As common basis for domain-specific models we present an abstract meta- 
model which is independent of the underlying domain of application and provides 
a suitable platform to describe heterogeneous and semi- structured course mate- 
rial from diflferent domains of application on the web. The meta- model describes 
what it means to be a model, i.e. gives a definition of the general kind of struc- 
ture description that is accepted and can be understood. This way we obtain 
an extensible generic framework which is easy to modify and extend: Teaching 
environments can be extended by new models for as different teaching areas as 
database theory or the world of opera. Furthermore, existing models can be eas- 
ily extended to meet new requirements, e.g. by adding a video clip (see section 
3.3). Both are important features for the rapidly evolving domain of web based 
teaching. 

The rest of the paper is organized as follows: In section 2, we discuss com- 
mon as well as domain-specific properties of teaching material and line up the 
requirements to teachware management systems. In section 3, a meta-modeling 
approach to teachware is proposed: We illustrate the different levels of our mod- 
eling architecture and present an abstract extensible generic meta-model. Using 
a sample instantiation, a model of the domain of teaching and learning computer 
science, we show the benefits of our approach to the management of teachware. 
In section 4 we introduce LM^L, the Learning Material Markup Language and 
sketch the use of XSL to adapt learning material to different users. In section 5 
we discuss related work. The paper is concluded with a summary in section 6. 

2 Teaching Material: Properties and Requirements 

2.1 Properties of Teaching Material 

General Properties Semantically, teaching material consists of different con- 
ceptual units. There are course units but also training units as well as annota- 
tional units, etc., with the following properties: 

Conceptual units exist at various levels of granularity and have an inner struc- 
ture ranging from semantically unspecified floating text to semantical content 
objects. Conceptual units as well as the content objects within have relationships 
to other objects and units. These relationships are of different semantical types 
and are not bound to the same kind of conceptual unit. 

Looking at teaching material, we have to consider navigational aspects con- 
cerning access to and navigation in the given conceptual units, as well: 

Conceptual units can be presented to the user in various ways, e.g. as a (se- 
quence of) HTML page(s) or pages in a PDF document. There are different types 
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of access and navigation: Teaching material usually is accessed and navigated 
hierarchically via a root node or sequentially starting at a first node. However, 
there are also indexes, glossaries, and the like that allow direct, non-hierarchical 
access. Conceptual relationships usually can be presented as hyperlinks, provid- 
ing an specific operational behaviour. 

Finally, additional information (metadata) are assigned not only to (se- 
quences, hierarchies, etc. of) entire conceptual units but also to (some of) the 
content objects and relationships they contain. 

Domain- Specific Properties Whereas the properties mentioned above seem 
to be common to teaching material of different domains of application, there 
are domain-specific kinds of conceptual units, content objects and relationships 
as well as navigational units and access structures. Domain-specific types of 
hyperlinks show different kinds of operational behaviour and there exist domain- 
specific kinds of metadata. 

2.2 Requirements to Teachware Management Systems 

Teachware management systems^ address two different groups of users: authors 
and learners. Both of them need support for thematic or typespecific filtering 
to provide individual access. They are also looking for adaptable navigation and 
presentation of given contents and structures, not only at the coarse granularity 
of entire courses or chapters, but rather at the fine granularity of content objects. 
Furthermore, convenient navigation using ancillary structures and sophisticated 
querying capabilities are important, as well. 

In addition, authoring tools are necessary for creating new material and in- 
tegrating existing one which often is heterogenous and only semi-structured. 
Finding ’similar’ material, reuse, (re-)combination, adaptation, and sharing of 
existing material as well as controlling sources and versions should be supported, 
too. 

In general, a teachware management system should be able to manage teaching 
material from different domains of application using datab 2 ise technology which 
turns out to improve especially the access to huge amounts of documents [1]. 

3 A Meta-Modeling Approach to Teaching Material 

3.1 Modeling at Different Levels 

The properties and requirements described in the previous section emphasize the 
need for a common content and navigation structure of course material while at 
the same time they call for domain-specific instantiations. This suggests to use 
a meta-modeling approach to hypermedia- based teaching material rather than 
to focus on a single model of course material. 

Figure 1 shows the different levels of our meta-modeling architecture: 

® The mentioned requirements are not only applicable to teachware but also to knowl- 
edge management systems in general. 
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Fig. 1. Modeling at different levels 



1. At the bottom layer, the real world consists of the subjects to be taught or 
learned, or, in general, the knowledge to be managed. Two (obviously quite 
heterogenous) sample domains we refer to in this paper are database theory 
and the world of opera. 

2. At the second layer, hypermedia teachware or, in general, hypermedia 
documents, including access and navigation aspects, are describing the given 
domain of application in an appropriate way using the domain-specific means 
of section 2.1. 

3. In domain-specific models, we describe the content and navigation struc- 
ture of course material or, in general, of hyperdocuments, of a given domain 
of application. Rather than modeling the topic itself - which is left to the 
content of the teaching documents - at this layer schemata are defined that 
control the admissible form of the documents. 

4. Finally, the common abstract meta-model describes what it means to be a 
domain-specific model, i.e. gives a definition of the general kind of structure 
description that is accepted and can be understood. 

3.2 The Abstract Meta-Model for Teaching Material 

The properties found to be common to teaching material of different domains of 
application (cf. section 2.1) are realized in a straightforward way in the abstract 
meta-model, which we use to describe teaching material in general. 

The graphical Unified Modeling Language (UML) [5,8] is used to visual- 
ize our models. Note, that within our models, the concepts can be organized 
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in a concept hierarchy with single inheritance relations among concepts. This 
generalization within the models is not to be mistaken with the generalization 
between the meta-model and a domain-specific model! In addition, concepts in 
both models which generalize other concepts are called abstract and cannot be 
instantiated. Especially, all concepts in the abstract meta-model are abstract. 




Fig. 2. abstrawit meta-model of teachware 



On the right hand side of figure 2, the abstract conceptual content struc- 
ture of the course material is specified, i.e. which concepts model the content 
of teaching material. The conceptual units the teaching material consists of, 
are generalized to the abstract concept <ConceptualUnit». Their variable in- 
ner structure is realized by -^ContentObjects^ . Both are, at the meta- model 
level, instances of the general concept -^Resource^ thus allowing all kinds of 
relationships, described by ^Relationship^ , to hold for ■^ConceptualUnits» and 
^ContentObjects^ as well. To allow the assignment of different types of meta- 
data to -^ConceptualUnits^, -^Content Objects^ as well as "^Relationships^ , we 
use the concept "^ConceptualProperties^. 

On the left hand side of figure 2, the abstract navigational structure of 
teaching material is specified, reflecting the properties mentioned in section 2.1. 
« ConceptualUnits^ are presented to the user by different kinds of "^Navigational- 
Units^ which are the terminal nodes of a polyhierarchical hypermedia struc- 
ture similar to directories or books, but allowing nodes to belong to more 
than one supernode. Indeed, each -^Node^ has to be contained in at least one 
■^StructureNode^ allowing e.g. hierarchical as well as sequential navigation (as 
illustrated in section 3.3). The meta-model also specifies which "^Relationships^ 
are relevant to navigation and therefore are presented by -^Links^ which can be 
instantiated with different kinds of operational behaviour. To allow the assign- 
ment of different types of metadata to "^NavigationalUnits», "f^StructureNodes^ 
as well as <Links», we use the concept ^NavigationalProperties^ . 
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3.3 Supporting Designers, Authors and Learners 

This section discusses how designers of domain-specific models, authors, and 
learners can be supported e.g. by the exploitation of domain-specific inner struc- 
ture as well as of metadata by a teachware management system. 



Sample Domain-Specific Model As a sample instantiation of our meta- 
model, figure 3 shows a model"* of course material on the domain of learning and 
teaching computer science which is based on real life course material. 




Fig. 3. Domain-specifc model for computer science material 



A domain-specific model defines in which instances of ^ ConceptualUnit^ 
which instances of < ContentObject^ can be present. In this model there is only 
one conceptual unit, namely CourseUnit, which can contain the following content 
objects; floating text, e.g. PlainTexts, UnorderedLists, OrderedLists, and Tables, 
or specified objects, e.g. Definition, Proposition, Proof, Algorithm, etc. Specified 
Objects can include floating text, e.g. a Definition can include an OrderedList, 
restricted by certain constraints like: a Proof must not contain another Proof. 

Also instances of Relationships as well as their possible source and target 
objects are specified: Some of the content objects are connected by different 
instances of the conceptual relationship refers-to. For example, in the course 
material instances of all objects can be illustrated by a figure via illustrates 

Note that we have omitted names, roles, and multiplicities to increase readability. 



4 
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and there are relations like Proof proves Proposition. On the one hand, as the 
relationship proves inherits from the relationship refers-to, the set of all instances 
of the proves relationship object is a subset of the set of all instances of refers J,o. 
On the other hand, the relationship proves has as source the object Proof and 
as possible target the object Proposition. As inheritance between relationship 
objects also restricts the target and source objects of the inheriting relationship 
objects, the proves relationship only can start and end at objects corresponding 
to the definition of refers.to or subtypes thereof. 

In our sample model there are three kinds of properties, i.e. groups of meta- 
data, describing CourseUnits, Content Objects and Relationships. For the sake of 
simplicity, in our example, those properties are vectors of attribute/value pairs, 
where attributes are named properties of objects, and values are atomic (text 
strings, numbers, etc.). The properties in our example describe general aspects 
{author, date, discipline, and language), content aspects {title and topics) and 
pedagogical aspects {difficulty). 

Finally, looking at the navigational aspect in our example, a CourseUnit 
on computer science (instance of -^ConceptualUnit^) is presented by a LM^L 
document (see section 4 for details), which is an instance of ^NavigationalUnit^ . 
These documents are grouped to sequences {GuidedTours) satisfying the partial 
ordering imposed by the two OrderingProperties prerequesites and objectives, 
which have as values text strings. 

By presenting different types of relationships in the content model by dif- 
ferent types of links in the navigational model, the navigational behaviour of 
relationships is separated from its conceptual meaning. It even is possible e.g. 
to specify an is-prerequisite relationship which is not navigable at all but can 
be used to compute the prerequisites of a certain learning goal which then are 
linearized and grouped to a GuidedTour to reduce disorientation of students. 

Filtering Content Objects and Relationhips As we have modeled the inner 
structure of course units it is possible to filter content objects by type. For 
instance, when reusing existing material, it is possible, e.g. to ask for all Examples 
or to filter by subtopic using the title attribute which is contained in the content 
properties of any content object. Similarly, relationships can be filtered by type, 
which allows e.g. to ask for relationships of type proves only. 

Adaptable Navigation and Presentation Relying on the filtering capabil- 
ities provided by the properties attributes of all objects, we are able to adapt 
the navigation through and the presentation of given contents and structures at 
the very fine granularity of single content objects. As an example, students of 
CSl are shown exactly those Definitions which are appropriate to the knowledge 
level of beginners selecting according to an attribute difficulty. 



Convenient Navigation In our sample model, there is a special access struc- 
ture GuidedTour which supports sequential browsing of selected topics as sug- 
gested by the author. There may, of course, be several guided tours each of 
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them adapting course material to a particular audience and aiming at a certain 
learning goal. Besides the navigational operations next and previous, authors 
and learners can also use the conceptual relationship structure as an ancillary 
structure, e.g. to navigate from a Proof via the prowes-relationship to the cor- 
responding Proposition. 



Complex Query Capabilities The conceptual relationship structure mentio- 
nend above can also be used by a teachware management system to pose queries 
like “Which definitions and formal results are needed by undergraduate students 
of computer science to understand B-trees from an application point of view?” 



Integrating New Material A CourseUnit as an instance of « ConceptualUnit* 
(see Figure 2) presents a self-contained sequence or set of content objects. The 
author can create a new course unit directly by creating a new LM'^L document 
(cf. section 4). She also can reuse existing material, e.g. a MS Word document. 
If this material has sectionings, an import wizard of a teachware management 
system could create course units using conventions such as ‘use sections of high- 
est sectioning depth as course units’. Furthermore, if there are specified objects, 
too, other conventions could be used, as well, e.g. ‘group a proposition and the 
proof which proves it on the same course unit’. Thus, the author has full control 
over the size of course units (she could even define the entire course material, e.g. 
a big MS Word document, to be one course unit) while getting support for fine 
granularity modularization. Finally, if there is no inner structure at all, existing 
material can be integrated using wrappers, i.e. integrating it as an atomic course 
unit. 



Combination and Sharing of Existing Material The use of metadata as 
described above allows authors to retrieve existing material according to various 
specifications and thus facilitates reuse and recombination. A teachware man- 
agement system can use appropriate attributes for source code control as well 
as version control. In addition, even ’similar’ material can be found, e.g. with 
similar pedagogical properties, provided that an appropriate metric is available. 
Analogously, metadata support sharing of material among several teachers. 



Supporting Designers of Models As the abstract meta-model has been im- 
plemented as a generic framework, it is easy for designers to reuse and extend 
existing domain-specific models or to create new ones. 

Suppose we want to integrate video clips into our course material on database 
theory, e.g. showing the teacher explaining an algorithm. Of course, multimedia 
objects like animations, video clips, or audio samples can be represented as 
content objects. All we have to do is to derive from the given abstract object 
Illustration a new object TeacherVideo. As an instance of Illustration it becomes 
a new possible source of the relationship illustrates. 
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To extend our framework we sketch how to design a completely new model 
for a different application domain. Suppose we want to describe learning material 
on the world of opera: In the appropriate content model there are no elements 
like Proposition, Proof, and so on. Rather we want to introduce concepts like 
MusicScore or WorkDescription. Thus, we define a new model for music mate- 
rial using these and other appropriate concepts but also reusing elements like 
Floating Text from our sample model for computer science material. 

4 Learning Material Markup Language 



<7xml verslon="l,0" standalone="no" ?> 

<!DOCTYPE LearningMaterial (View Source for full doctype...)> 

- <LeamingMaterlal> 

- ccourseunit author="Prof. Dr. B. Freitag" date="99/04/10" dlscipllne="computer science" 
language="english" title="B-tree" topics="B-tree, database index, search key, records, 
space, efficiency"> 

+ <illustration title="Search Key" topics="search key, records" difficulty="medium"> 

+ <definition title="B-Tree" topics="B-Tree" diffiojlty="low"> 

+ <definltion title="B-Tree" toplcs="B-Tree" dlfficulty="high"> 

+ <illustration title="B-Tree (k=2)" topics="B-tree" difficulty="low"> 

+ <proposition title="Space Efficiency" topics="space, efficiency" difficulty="medium"> 

+ <proof tltle="Space Efficiency" toplcs="space, efficiency" difficulty="high"> 
</courseunit> 

</LearningMaterial> 



Fig. 4. Default view of B-tree.xml with collapsed content objects 



In this section, we describe some implementation issues that make use of the 
advantages of our meta-modeling approach presented in the previous section. 
First, we need a format, in which structured learning material not only is stored 
at learning material web sites, but also can be easily accessed by teachware 
management systems. 

As described in section 3.3, the conceptual units of our sample database 
teaching material are presented as LM'^L documents. LM^L stands for Learning 
Material Markup Language, a XML application we have developed to markup 
formal teaching material in a Mathematics-oriented style. 

The extensible Markup Language (XML) [16] is a restricted form of SGML 
retaining the power and flexibility of SGML while removing some complex fea- 
tures. XML and its family of technologies are standards managed by the World 
Wide Web Consortium (W3C) for representing information as structured docu- 
ments. 

The LM'^L documents contain marked up sections, so called elements, that 
represent conceptual content objects by syntactical means. For example, in 
course material on database theory, content objects like Illustrations, Defini- 
tions, Propositions, or Proofs describe e.g. B-trees in a Mathematics-oriented 
style. Figure 4 shows a MS Internet Explorer 5 default view of a sample LM^L 
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source file with collapsed (and marked with a plus) elements representing content 
objects. 

Elements are declared in a Document Type Definition (DTD) [13], which 
expresses the structure of a document and consists of tag definitions and rules 
describing how elements may he nested, e.g. 

< ! — ================ Document Structure ============================== — > 

<! ELEMENT LearningMaterial (courseunit)> 

< [ENTITY */, specif iedobject 

"definition I proposition I theorem I algorithm I proof I illustration"> 
OELEMENT courseunit ( #PCDATA I 7, specif iedobject ; | 7f loatingtext ; )♦> 

To support adaptive presentation (see section 3.3) of course material and the 
transformation into different document formats if needed, e.g. HTML or 
for web or paper publishing, we use the extensible Style Language (XSL) [16]. 

XSL is based on Document Style Semantics and Specification Language 
(DSSSL) and its online version, DSSSL-O, and also uses some of the style el- 
ements of Cascading Style Sheets (CSS) [16]. It is simpler than DSSSL, while 
retaining much of its power. As opposed to CSS, XSL can not only render a 
document and add structure to it but can also be used to completely rearrange 
the input document structure. 

The elements in our material are enriched by attributes realizing the meta- 
data proposed in section 3.3: 

< I ENTITY 7, genattrs 

"author 7Text; 

date 7<Date ; 

language 7.Text ; 

discipline 7#Text ; 

< I ENTITY 7i contattrs 

"title 7iText; 

topics 7iText ; 

< I ENTITY 7. pedagattrs 

"difficulty (low I medium I high) #IMPLIED"> 

XSL style sheets use these metadata, e.g. describing the different levels of 
difficulty (cf. figure 5), for restructuring and rendering the corresponding XML 
course material for different audiences, layouts or platforms in web based teach- 
ing providing adaptable presentation. 

5 Related Work 

We discuss the distinction between conceptual and navigational aspects on two 
levels whereas e.g. [2, 9, 12] only use one modeling level. Work on hypermedia de- 
sign like [14] and [4] also discusses presentational aspects, the former using a sep- 
arate model for each aspect and the latter proposing a meta-modeling approach. 



#REQUIRED 

#REQUIRED 

#IMPLIED 

#IMPLIED"> 



#REQUIRED 

#IMPLIED"> 
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- <deflnltlon tltle="B-Tree" topics="B-Tree'' dlfficulty="low"> 

A B-tree is a multi level index. 

- <ul> 

<ll>The file is organized as a balanced multipath tree with</li> 
< 1 1 > reorganisation capabilities. </l I > 

</ul> 

</deflnltlon> 

- <deflnltion title="B-Tree" toplcs="B-Tree" difficulty="high"> 

A B-tree of height h and fan out of 2k+ 1 is either empty or an 
ordered tree with the following properties: 

+ <ol> 

</deflnltlon> 

Fig. 5. Definitions of a B-tree with different levels of difficulty 



In contrast to our work, the conceptual model of these approaches describes the 
domain of application presented by the given hypermedia documents. 

The semantics of the real world is represented e.g. by entities and relation- 
ships in an extended ER notation [4] or by conceptual classes, relationships, 
attributes, etc. in an object oriented modeling approach [14]. 

[4] uses the Telos Language [10] for expressing a meta-model for hypermedia 
design. Like UML it offers a well-defined declarative semantics. For a subset of 
Telos modeling tools exist, too. 

There are a number of (implemented) formalisms that are designed for con- 
ceptual modeling of the “meaning” of textual documents, e.g. Narrative Knowl- 
edge Representation Language (NKRL), Conceptual Graphs, Semantic Networks 
(SNePS), CYC, LILOG and Episodic Logic (for an overview, see [19]). As men- 
tioned earlier, in our approach we refrain from representing natural language 
semantics but choose to model the structure of educational material. 

[11] proposes knowledge items comparable to CourseUnits, but disregarding 
their inner structure. 

Concerning the use of metadata, nodes, or node types in Computer Based In- 
struction (CBI) frameworks like [7] are described by a fixed set of attributes, e.g. 
classification (glossary node, question node), control over access (system, coac- 
tive, proactive), display orientation (frame, window), and presentation (static, 
dynamic, interactive). In our approach, the classification attributes could be 
instances of ^-ConceptualUnit^, whereas the latter sets of attributes would be- 
long to the corresponding properties. The types of links are also fixed: contex- 
tual, referential, detour annotational, return and terminal. Using a free definable 
domain-specific model, we are also not tied to a fixed number of structure nodes 
like structure, sequence, or exploration as in [3]. 

To organize metadata in the sample domain-specific model, we restricted 
ourselves to vectors of attribute/ value pairs. In general, our abstract framework 
allows the use of arbitrary specifications of metadata, e.g. the use of labeled 
directed acyclic graphs which constitute the Resource Description Framework 
(RDF) model [17]. 

In the IMS approach [6] attributes similar to those used in our sample model 
are proposed which describe all kinds of educational resources in general and as 
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a whole. Beyond that, in our approach, we propose domain specific metadata 
describing small parts of conceptual units, e.g. Definitions, Proofs, etc. 

The XML based representation of teaching material developed in the TAR- 
GETEAM project [15] provides only a small fixed set of domain-independent 
content objects without attributes. 



6 Conclusion and Future Work 

A meta-modeling approach to adaptive hypermedia-based electronic teaching 
material has been presented which allows to describe knowledge about aspects 
of the contents of course material as well as navigational aspects. As an instance 
of the generic abstract meta-model we have described a sample computer science 
specific model for course material on database theory. Our approach combines 
some of the advantages of existing proposals and supports authors in adapt- 
ing existing material to new audiences as well as learners in adapting content 
and navigation to their needs. As a foundation for future implementation we 
introduced the Learning Material Markup Language (LM^L) to syntactically 
represent (formal) course material. 

The techniques described in this paper axe not only applicable to learning 
material but could also be applied to completely different types of knowledge 
material on the web. 

Future work will consider the modeling of other domains of applications as 
well as the corresponding generalizations to LM'^L needed to make it suitable 
to markup a wider variety of course material by using modularisation. Thereby, 
we are also integrating SMIL [18] to describe multimedia contents as well. 

Finally, a web based teachware management system is about to be imple- 
mented which comprises the meta-model introduced in this paper as well as 
methods to generate and verify the models governed by it. It will also provide 
tools to support the creation and management of teachware, especially the com- 
position and configuration to new audiences and learning goals. 
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Abstract. Data warehousing and electronic commerce are two of the 
most rapidly expanding fields in recent information technologies. In this 
paper, we discuss the design of data wau'ehouses for e-commerce environ- 
ments. We discuss requirement analysis, logical design, and aggregation 
in e-commerce environments. We have collected an extensive set of inter- 
esting OLAP queries for e-commerce environments, and classified them 
into categories. Based on these OLAP queries, we illustrate our design 
with data warehouse bus architecture, dimension table structures, a base 
star schema, and an aggregation star schema. We finally present various 
physical design considerations for implementing the dimensional models. 
We believe that our collection of OLAP queries and dimensional models 
would be very useful in developing any real-world data warehouses in 
e-commerce environments. 



1 Introduction 

In this paper, we discuss the design of data warehouses for the electronic com- 
merce (e-commerce) environment. Data warehousing and e-commerce are two of 
the most rapidly expanding fields in recent information technologies. Forrester 
Research estimates that e-commerce business in the US could reach to $327 
billion by 2002 [7], and International Data Corp. estimates that e-commerce 
business could exceed $400 billion [9]. Business analysis of e-commerce will be- 
come a compelling trend for competitive advantage. A data warehouse is an 
integrated data repository containing historical data of a corporation for sup- 
porting decision-making processes. A data warehouse provides a basis for online 
analytic processing and data mining for improving business intelligence by turn- 
ing data into information and knowledge. Since technologies for e-commerce 
are being rapidly developed and e-businesses are rapidly expanding, analyzing 
e-business environments using data warehousing technology could significantly 
enhance business intelligence. A well-designed data warehouse feeds businesses 
with the right information at the right time in order to make the right decisions 
in e-commerce environments. 

The addition of e-commerce to the data warehouse brings both complexity 
and innovation to the project. E-commerce is already acting to unify standalone 
transaction processing systems. These transaction systems, such as sales systems, 
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marketing systems, inventory systems, and shipment systems, all need to be 
accessible to each other for an e-commerce business to function smoothly over 
the Internet. In addition to those typical business aspects, e-business now needs 
to analyze additional factors unique to the Web environment. For example, it 
is known that there is significant correlation between Web site design and sales 
and customer retention in an e-commerce environment [15]. Other unique issues 
include capturing the navigation habits of its customers [12], customizing the 
Web site design or Web pages, and contrasting the e-commerce side of a business 
against catalog sales or actual store sales. Data warehousing could be utilized 
for all of these e-commerce-specific issues. Another concern in designing a data 
warehouse in the e-commerce environment is when and how we capture the 
data. Many interesting pieces of data could be automatically captured during 
the navigation of Web sites. 

In this paper, we present requirement analysis, logical design, and aggregation 
issues in building a data warehouse for e-commerce environments. We have col- 
lected an extensive set of interesting OLAP (online analytic processing) queries 
for e-commerce environments, and classified them into categories. Based on these 
OLAP queries and Kimball’s methodology for designing data warehouses [13], 
we illustrate our design with a data warehouse bus architecture, dimension table 
detail diagrams, a base star schema, and an example of aggregation schema. To 
our knowledge, there has been no detailed and explicit dimensional model on 
e-commerce environments shown in the literature. Kimball discusses the idea of 
a Clickstream data mart and a simplified star schema in [12]. However, it shows 
only the schematic structure of the star schema. Buchner and Mulvenna discuss 
the use of star schemas for market research [4]. They also show only a schematic 
structure of a star schema. In this paper we show the details of a dimensional 
model for e-commerce environments. We do not claim that our model could be 
universally used for all e-commerce businesses. However, we believe that our col- 
lection of OLAP queries and dimensional models could provide a framework for 
developing a real-world data warehouse in e-commerce environments. 

The remainder of this paper is organized as follows; Section 2 discusses our 
data warehouse design methodology. Section 3 presents requirement analysis as- 
pects and interesting OLAP queries in e-commerce environments. Section 4 cov- 
ers logical design and development of dimensional models for e-commerce. Sec- 
tion 5 discusses aggregation of the logical dimension models. Section 6 concludes 
our paper and discusses further research issues in designing data warehouses for 
e-commerce. 

2 Our Data Warehouse Design Methodology 

The objective of a data warehouse design is to create a schema that is optimized 
for decision support processing. OLTP systems are typically designed by develop- 
ing entity-relationship diagrams (ERD). Some research shows how to represent a 
data warehouse schema using an ER-like model [3, 8, 11, 17] or use the ER model 
to verify the star schema [14]. The data schema for a data warehouse must be 
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Fig. 1. Our Methodology 



simple to understand for a business analyst. The data in a data warehouse must 
be clean, consistent, and accurate. The data schema should also support fast 
query processing. The dimensional model, also known as star schema, satisfies 
the above requirements [2, 10, 13]. Therefore, we focus on creating a dimensional 
model that represents data warehousing requirements. 

Data warehousing design methodologies continue to evolve as data warehous- 
ing technologies are evolving. We still do not have a thorough scientific analysis 
on what makes data warehousing projects fail and what makes them successful. 
According to a study by the Gartner group, the failure rate for data warehousing 
projects run as high as 60%. Our extensive survey shows that one of the main 
means for reducing the risk is to adopt an incremental developmental methodol- 
ogy as in [1, 13, 16]. These methodologies allow you to build the data warehouse 
based on an architecture. In this paper, we adopt the data warehousing design 
methodology suggested by Kimball and others [13]. The methodology to build a 
dimensional model consists of four steps: 1) choose data mart, 2) choose granu- 
larity of fact table, 3) choose dimensions appropriate for the granularity, and 4) 
choose facts. 

Our modified design methodology based on [13] can be summarized in Fig- 
ure 1. We have collected an extensive set of OLAP queries for requirement anal- 
ysis. We used the OLAP queries as a basis for the design of the dimension model. 

3 Requirements Analysis 

In this section, we present our approach for requirement analysis based on an 
extensive collection of OLAP queries. 



3.1 Requirements 

Requirements definition is an unquestioningly important part of designing a data 
warehouse, especially in an electronic commerce business. The data warehouse 
team usually assumes that all the data they need already exist within multiple 
transaction systems. In most cases this rissumption will be correct. In an e- 
commerce environment, however, this may not always be true since we also need 
to capture e-commerce-unique data. 

The major problems of designing a data warehouse for e-commerce environ- 
ments are: 

— handling of multimedia and semi-structured data, 

— translation of paper catalog into a Web database. 
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— supporting user interface at the database level (e.g., navigation, store layout, 
hyperlinks), 

— schema evolution (e.g. merging two catalogs, category of products, sold-out 
products, new products), 

— data evolution (e.g. changes in specification and description, naming, prices), 

— handling meta data, and 

— capturing navigation data within the context [12]. 

In this paper, we focus on data available from transactional systems and 
navigation data. 

3.2 OLAP Queries for E-Commerce 

Our approach for requirements analysis for data warehousing in e-commerce 
environments was to go through a series of brainstorming sessions to develop 
OLAP queries. We visited actual e-commerce sites to get experience, simulated 
many business scenarios, and developed these OLAP queries. After capturing 
the business questions and OLAP queries, we categorized them. This classifi- 
cation was accomplished by consulting business experts to determine business 
processes. Some of the processes determined by the warehouse designers in co- 
operation with business experts fell into the following seven categories: Sales & 
Market Analysis, Returns, Web Site Design & Navigation Analysis, Customer 
Service, Warehouse/ Inventory, Promotions, and Shipping. Most often these cat- 
egories will become a subject area around which a data mart will be designed. In 
this paper we show some OLAP queries of only two categories: Sales & Market 
Analysis and Web Site Design & Navigation Analysis. 

Sales & Market Analysis: 

— What is the purchase history/pattern for repeated users? 

— What type of customer spends the most money? 

— What type of payment options is most common (by size of purchase and by 
socio-economic level)? 

— What is the demand for the Top 5 items based on the time of year and 
location? 

— List sales by product groups, ordered by IP address. 

— Compared to the same month last year, what are the lowest 10% items sold? 

— Of multiple product orders, is there any correlation between the purchases 
of any products? 

— Establish a profile of what products are bought by what type of clients. 

— How many different vendors are typically in the customer’s market basket? 

— How much does a particular vendor attract one socio-economic group? 

— Since our last price schedule adjustment, which product sales have improved 
and which have deteriorated? 

— Do repeat customers make similar product purchases (within general product 
category) or is there variation in the purchasing each time? 
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— What types of products do repeat customers purchase most often? 

— What are the top 5 most profitable products by product category and de- 
mographic location? 

— Which products are also purchased when one of the top 5 selling items is 
purchased? 

— What products have not been sold online since X days? 

— What day of the week do we do the most business by each product category? 

— What items are requested but not available? How often and why? 

— What is the best sales month for each product? 

— What is the average number of products per customer order purchased from 
the Web site? 

— What is the average order total for orders purchased from the Web site? 

— How well do new items sell in their first month? 

— What season is the worst for each product category? 

— What percent of first-time visitors 6«;tually make a purchase? 

— What products attract the most return business? 

— Based on history and known product plans, what are realistic, achievable 
targets for each product, time period and sales channel? 

— Have some new products failed to achieve sales goals? Should they be with- 
drawn from online catalog? 

— Are we on target to achieve the month-end, quarter-end or year-end sales 
goals by product or by region? 

Web Site Design &: Navigation Analysis: 

— At what time of day does the peak traffic occur? 

— At what time of day does the most purchase traffic occur? 

— Which types of navigation patterns result in the most sales? 

— How often do purchasers look at detailed product information by vendor 
types? 

— What are the ten most visited pages? (By day, weekend, month, season) 

— How much time is spent on pages with and without banners? 

— How does a non-purchase correlate to Web site navigation? 

— Which vendors have the most hits? 

— How often are comparisons asked for? 

— Based on page hits during a navigation path, what products are inquired 
about most but seldom purchased during a visit to the Web site? 



4 Logical Design 

In this section, we present the various diagrams that capture the logical detail 
of data warehouses for e-commerce environments. 
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Data Warehouse Bus Architecture for E-commerce Business 
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Fig. 2. Data Weirehouse Bus Architecture for E-Commerce 



4.1 Data Warehouse Bus Architecture 

Prom analyzing the above OLAP categories, the data warehouse team can now 
design an overview of the electronic commerce business, called the data ware- 
house bus architecture, as shown in Figure 2. The data warehouse bus architec- 
ture is a matrix that shows dimensions in columns and data marts as rows. The 
matrix shows which dimensions are used in which data marts. The architecture 
was first proposed by Kimball [13] in order to standardize the development of 
data marts into an enterprise-wide data warehouse, rather than causing stove- 
pipe data marts that were unable to be integrated into a whole. By designing the 
warehouse bus architecture, the design team determines before building any of 
the dimensions which dimensions must conform across multiple subject areas of 
the business. The dimensions used in multiple data marts are called conformed 
dimensions. Designing a single data mart still needs to examine other data mart 
areas to create a conformed dimension. This will ease the integration of multi- 
ple data marts into an integrated data warehouse later. Taking time to develop 
conformed dimensions lessens the impact of changes later in the development 
life cycle while the enterprise- wide data warehouse is being built. 

The e-commerce business has several significant differences from regular busi- 
nesses. In the case of e-commerce, you do not have a sales force to keep track of, 
just customer service representatives. You cannot influence your sales through a 
sales force incentives in e-commerce, therefore, the business must find other ways 
to influence its sales. This means that e-commerce businesses must pay more at- 
tention not only to proper interface design, but also to their promotions and 
advertisements to determine their effect upon the business. This includes track- 
ing coupons, letter mail lists, e-mail mailing lists, banner ads, and ads within 
the main Web site. 
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The e-commerce business also needs to keep track of clickstream activity, 
something that the average business need not worry about. Clickstream analysis 
can be compared to physical store analysis, wherein the business analysts de- 
termine which items in which locations sell better as compared to other similar 
items in different locations. An example of this is analyzing sales from endcaps 
against sales from along the paths between shelves. There is similarity between 
navigating a physical store and navigating a Web site in that the e-commerce 
business must maximize the layout of its Web site to provide friendly, easy nav- 
igation and yet guide consumers to its more profitable items. Tracking this is 
more difficult than in a traditional store, since there are so many more possible 
combinations of navigation patterns throughout a Web site. 

Furthermore, the very nature of e-commerce makes gathering demographic 
information on customers somewhat complicated. The only information cus- 
tomers provide is name and address, and perhaps credit card information. For 
statistical analysis, DW developers would like demographic information, such as 
gender, age, household income, etc., in order to be able to correlate it with sales 
information to show which sorts of customers are buying which types of items. 
Such demographic information may be available elsewhere, but gathering it di- 
rectly from customers may not be possible in e-commerce. The data warehouse 
designers must be aware of this potential complication and be able to design for 
it in the data warehouse. 

4.2 Dimension Models 

Granularity of the Fact Table. The focus of the data warehouse design 
process is on a subject area. With this in mind, the designers return to analyzing 
the OLAP queries to determine the granularity of the fact table. Granularity 
determines the lowest, most atomic data that the data warehouse will capture. 
Granularity is usually discussed in terms of time, such as a daily granule or a 
monthly granule. The lower the level of granularity, the more robust the design 
is since a lower granularity can handle unexpected queries and the later addition 
of new data elements later [13]. For a sales subject area, we want to capture 
daily line item transactions. The designers have determined from the OLAP 
queries collected during the course of the requirement definition that the business 
analysts want a fine level of detail to their information. For example, one OLAP 
query is “What type of products do repeat customers most often purchase?” In 
order to answer that question the data warehouse must provide access to all of 
the line items on all sales, because that is the level that associates a specific 
product with a specific customer. 

Dimension Table Detail Diagrams. After determining the fact table gran- 
ularity, the designers then proceed to determine the dimensional attributes. Di- 
mension attributes are those that are used to describe, constrain, browse, and 
group fact data. The dimensions themselves were already determined when the 
data warehouse bus architecture was designed. Once again, the warehouse de- 
signers return to analyzing the OLAP queries in order to determine important 
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attributes for each of the dimensions. The warehouse team starts documenting 
the attributes they find throughout the OLAP queries in dimension table detail 
diagrams, as shown in the Product Dimension detail diagram in Figure 3.^ The 
diagram models individual attributes within a single dimension. The diagram 
also models various hierarchies among attributes and properties, such as slowly 
changing attributes and cardinality. The cardinality is shown on the top of each 
box inside the parentheses. 

We look closely at the nouns in each of the OLAP queries; these nouns are the 
clues to the dimension’s attributes. For example the query “List sales by product 
groups, ordered by IP addresses.” We ignore sales, since that refers to a fact to be 
analyzed. We already have product as one of their dimensions, so product group 
would be one of the attributes. Notice that there is not a product group attribute. 
One analyst uses product groups in his queries, while another uses subcategories. 
Both of them have the same semantics. Hence, subcategory replaced the product 
group. One lesson to be remembered is that the design should not be finalized 
using the first set of attributes that are derived. Designing the data warehouse 
is an iterative process. The designers present their findings to the business with 
the use of dimension table detail diagrams. In most cases the diagrams will be 
revised in order to maintain consistency throughout the business. 

Returning to our example query, the next attribute found is the IP (internet 
protocol) address. This is a new attribute to be captured in the data warehouse, 

^ Other dimension table detail diagrruns are omitted here due to laick of space (see 
[18] for more queries and diagrams.), but they are summarized in Figure 4 
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since a traditional business does not need to capture IP addresses. The next 
question is “which dimension do we put it in?” 

The customer dimension is where the ISP (Internet Service Provider) is cap- 
tured. There is some argument around placing it directly within the customer 
dimension. In some cases, depending upon where the customer information is 
derived from, the designers may want to create a separate dimension that col- 
lects e-mail addresses, IP addresses, and ISPs. Design arguments such as these 
are where the data warehouse design team usually starts discussing possible sce- 
narios. There is a compelling argument for creating another dimension devoted 
to on-line issues such as e-mail addresses, IP addresses, and ISPs. Many indi- 
viduals surf to e-commerce Web sites to look around. Of these individuals, some 
will register and buy, some will register but never buy, and some individuals will 
never register and never buy. An e-commerce business needs to track individuals 
that become customers (those that buy) in order to generate a more complete 
customer profile. A business also needs to track those that do not buy, for whom 
the business may have incomplete or sometimes completely false information. 

Lastly, one of the hardest decisions is where to capture Web site design and 
navigation. Does it belong in a subject area of its own or is it connected to the 
sales subject area? Kimball has presented a star schema for a clickstream data 
mart [12]. While this star schema will capture individual clicks (following a link) 
throughout the Web site, it does not directly reflect which of those clicks result 
in a sale. The decision to include a Web site dimension within the sales star 
schema must be made in conjunction with business experts, who will ultimately 
be called upon to analyze the information. Many of the queries collected from 
brainstorming sessions centered around Web site navigation, as well as the design 
elements of the actual Web pages. For this reason we created two dimensions, a 
Web site dimension and a navigation dimension. The Web site dimension pro- 
vides the design information about the Web page the customer is purchasing a 
product from. Having information about Web page designs that result in suc- 
cessful sales is another tangible benefit of including a Web site dimension in the 
data warehouse. Much research has been invested in determining how best to 
sell products through catalogs. That information may not translate into selling 
products in an interactive medium such as the internet. Businesses need more 
than access to data about which Web pages are selling products. They need 
to know the specifics about that page, such as how many products appear on 
that page and whether there are just pictures on the Web page or just a text 
description or both. Much of the interest heis focused on navigation patterns and 
Web site interface design, sometimes to the exclusion of the actual design of the 
product pages. The business needs access to both kinds of information, which 
our design provides. 

These are just a few examples of dimension detail structure that was de- 
vised for our e-commerce star schema. After several iterations of determining 
attributes, assigning them to dimensions, and then gaining approval from busi- 
ness experts, dimension design stabilizes. 
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Fact Table Detail Diagram. Next, the fact table’s attributes must be de- 
termined. All the attributes of a fact table are captured in a fact table detail 
diagram (not shown here for the lack of space — see [18]). The diagram includes 
a complete list of all facts, including derived facts. Basic captured facts include 
line item product price, line item quantity, and line item discount amount. Some 
facts can be derived once the basic facts are known, such as line item tax, line 
item shipping, line item product total, and line item total amount. Other non- 
additive facts may be usefully stored in the fact table, such as average line item 
price and average line item discount. However, these derived facts may be more 
usefully calculated at an aggregate level, such as rolled up to the order level 
instead of the order line item level. Once again fact attributes will need to be 
approved by the business experts before proceeding. 



A Complete Star Schema for E-Commerce. Finally, all the pieces have 
been revealed. Through analyzing the OLAP queries we started with, we have 
determined the necessary dimensions, the granularity of the fact table at the 
daily line item transaction level, the attributes that belong in each dimension, 
and the measurable facts that the business wants to analyze. These pieces are 
now joined into the logical dimensional diagram for the sales star schema for e- 
commerce in Figure 4. As the star schema shows, the warehouse designers have 
provided the business users with a wide variety of views into their e-commerce 
sales. Also, by having already collected sample OLAP queries, the data ware- 
house can be tested against realistic scenarios. The designers pick a query out 
of the list and test to see if the information necessary to answer the question 
is available in the data warehouse. For example, how much of our total sales 
occur when a customer is making a purchase after arriving from a Web site 
containing an ad banner? In order to satisfy this query, information needs to be 
drawn from the navigation and the advertising dimensions, and possibly from 
the customer dimension as well if the analyst wants further groupings for the 
information. The data warehouse design team will conduct many scenario-based 
walkthroughs of the data warehouse in conjunction with the business experts. 
Once everyone is satisfied that the design of the data warehouse will satisfy the 
needs of the business users, the next step is to complete the physical design. 

5 Aggregation 

In this section, we discuss the aggregation of the fact table and show an example 
aggregation schema. Aggregation pre-computes summary data from the base ta- 
ble. Each aggregation is stored as a separate table. This subject on aggregation 
is perhaps the single area that in data warehousing has the largest technology 
gap between research community and commercial systems [6]. Research liter- 
ature is abundant with many papers on materialized views (MVs). The three 
main issues of MVs are selecting an optimal set of MVs, maintaining those MVs 
automatically and incrementally, and optimizing queries using those MVs. While 
research on MVs focused on automating those three issues, commercial practice 
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Fig. 4. Base Star Schema for E-Commerce Sales 



has been to manually identify aggregations and maintain them using metadata 
in a batch mode. Commercial systems only recently have begun to implement 
rudimentary techniques supporting materialized views [5] . 

Aggregation is probably the single most powerful feature for improving per- 
formance in data warehouses. The existence of aggregates can speed query- 
ing time by a factor of 100 or even 1,000 [13]. Because of this dramatic ef- 
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feet on performance, building aggregates should be considered as a part of 
performance-tuning the warehouse. Therefore, the existing aggregate schemes 
should be reevaluated periodically as business requirements change. We note 
that the benefits of aggregation come with the overhead of additional storage 
and maintenance. 

Two important concerns in determining aggregation schemes are common 
business requests and statistical distribution of data. We prefer aggregation 
schemas that answer most common business requests and that reduce the num- 
ber of rows to be processed. 

Deciding which dimensions and attributes are candidates for aggregation re- 
turns the warehouse designers back to their collection of OLAP queries and 
their prioritization. With the use of the OLAP queries, the designers can de- 
termine the most common business requests. In many cases the business will 
have some preset reporting needs, such as time periods, major products, or spe- 
cific customer demographics that they routinely report. Those areas that will be 
frequently accessed or routinely reported become candidates for aggregation. 

The second determination is made through the examination of the statistical 
distribution of the data. For that information, the warehouse team returns to 
the dimension detail diagrams that show the potential cardinality of each major 
attribute. This is where the potential benefits of the aggregation are weighed 
against the impact of the additional storage and processing overhead. The ob- 
ject of this exercise is to evaluate each of the dimensions that could be included 
in the query and the hierarchies (or groupings) within the dimensions, and list 
the probable cardinalities for each level in the hierarchy. Next, to assess the po- 
tential record count, multiply the highest cardinality attributes in each of the 
dimensions to be queried. Then determine the sparsity of the data. If the query 
involves the product and time, then determine whether the business sells every 
individual product every single day (no sparsity). If not, determine what per- 
centages of products are sold every day (% sparsity). If there is some sparsity in 
the data then multiply the preceding record count by the percentage sparsity. 
Then, in order to evaluate the benefit of the aggregation, choose one hierarchy 
level in each of the dimensions associated with the query. Multiply the cardi- 
nalities of those higher level attributes and divide into the row count arrived at 
multiplying the highest cardinality attributes. 

The example aggregation schema is shown in Figure 5. Each aggregation 
schema is connected with the base star schema. The dimensions attached in the 
base schema are called base dimensions. In the aggregation schema, dimensions 
containing a subset of the original base dimension are called shrunken dimensions 
[13]. Here only three dimensions are impacted and become shrunken dimensions. 
This aggregation still provides much useful data in that it can be used for the 
analysis of any higher-level attributes in the hierarchy of Brand dimension or 
Month dimension. For example, the aggregation can be used for the analysis 
along the Brand dimension (Subcategory by Month, Category by Month, and 
Department by Month) and along Month dimension (Brand by Quarter and 
Brand by Year). This example shows the inherent power in using aggregates. 
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6 Summary and Discussion 

In this paper, we have presented the design of data warehouses for e-commerce 
environments. We discussed requirements analysis, logical design, and aggrega- 
tion for e-commerce. We have presented an extensive set of interesting OLAP 
queries (see [18] for even more), data warehouse bus architecture, dimension table 
structures, a base star schema, and an aggregation star schema for e-commerce 
environments. To our knowledge, this is the first published detailed dimensional 
model specifically targeted to e-commerce. We do not claim that our model could 
be universally used for all e-commerce businesses. Our dimension model can be 
refined to focus on each specific subject area. Since our requirement has been 
focused on OLAP queries and not every OLAP query is identified from the be- 
ginning, our dimensional model has to be modified and refined. However, our 
collection of OLAP queries and dimensional models is very useful in developing 
real-world data warehouses in e-commerce environments. 

Some e-commerce unique issues include capturing the navigation habits of its 
customers, customizing the Web site design or Web pages, and contrasting the e- 
commerce side of its business against catalog sales or actual store sales. Another 
difficult issue in designing a data warehouse in the e-commerce environments 
is when and how the data is captured. More research will need to be devoted 
to the best means of capturing those data elements. Some additional questions 
may need to be answered as well, such as where should the clicks (hyperlink 
selections) be tracked? Should that be a separate star schema as Kimball has 
proposed, or should it be incorporated within the most powerful and useful of 
the star schemas, such as the sales star schema? Real-world data warehouses for 
an e-commerce should clarify these issues. 
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